The US's entire economy depends on tech. They won't do anything that would compromise the integrity and viability of the international tech industrial complex.
In the US you also are not arrested for social media posts like you are in the UK or other parts of Europe.
The general conceit of this article, which is something that many frontier labs seem to be beginning to realize, is that the average human is no longer smart enough to provide sufficient signal to improve AI models.
No, it's that the average unpaid human doesn't care to read closely enough to provide signal to improve AI models. Not that they couldn't if they put in even the slightest amount of effort.
Firstly, paying is not at all the correct incentive for the desired outcome. When the incentive is payment, people will optimize for maximum payout not for the quality goals of the system.
Secondly, it doesn't fix stupidity. A participant who earnestly takes the quality goals of the system to heart instead of focusing on maximizing their take (thus, obviously stupid) will still make bad classifications due to that reason.
> Firstly, paying is not at all the correct incentive for the desired outcome. When the incentive is payment, people will optimize for maximum payout not for the quality goals of the system.
1. I would expect any paid arrangement to include a quality-control mechanism. With the possible exception of if it was designed from scratch by complete ignoramuses.
1. Goodhart's law suggests that you will end up with quality control mechanisms which work at ensuring that the measure is being measured, but not that it is measuring anything useful
2. Criticism of a method does not require that there is a viable alternative. Perhaps the better idea is just to not incentivize people to do tasks they are not qualified for
I don't think there is any correct incentive for "do unpaid labour for someone's proprietary model but please be diligent about it"
edit: ugh. it's even worse, lmarena itself is a proprietary system, so the users presumably don't even get the benefit of an open dataset out of all this
I'm being (mostly) serious, suppose you're a stuffed ahort trying to boost your valuation, how can you work out who's smart enough to train your LLM? (Never mind how to get them to work for you!)
I do a lot of human evaluations. Lots of Bayesian / statistical models that can infer rater quality without ground truth labels. The other thing about preference data you have to worry about (which this article gets at) is: preferences of _who_? Human raters are a significantly biased population of people, different ages, genders, religions, cultures, etc all inform preferences. Lots of work being done to leverage and model this.
Then for LMArena there is the host of other biases / construct validity: people are easily fooled, even PhD experts; in many cases it’s easier for a model to learn how to persuade than actually learn the right answers.
But a lot of dismissive comments as if frontier labs don’t know this, they have some of the best talent in the world. They aren’t perfect but they in a large sene know what they’re doing and what the tradeoffs of various approaches are.
Human annotations are an absolute nightmare for quality which is why coding agents are so nice: they’re verifiable and so you can train them in a way closer to e.g. alphago without the ceiling of human performance
Sure, on the surface judging the judge is just as hard as being the judge
But at least the two examples of judging AI provided in the article can be solved by any moron by expending enough effort. Any moron can tell you what Dorothy says to Toto when entering Oz by just watching the first thirty minutes of the movie. And while validating answer B in the pan question takes some ninth-grade math (or a short trip to wikipedia), figuring out that a nine inch diameter circle is in fact not the same area as a 9x13 inch square is not rocket science. And with a bit of craft paper you could evaluate both answers even without math knowledge
So the short answer is: with effort. You spend lots of effort on finding a good evaluator, so the evaluator can judge the LLM for you. Or take "average humans" and force them to spend more effort on evaluating each answer
Popularity has never been a meaningful signal of quality, no matter how many tech companies try to make it so, with their star ratings, up/down voting, and crowdsourcing schemes.
Different strokes for different folks: I mean who is to say if Bleach or Backstabbed in a Backwater Dungeon: My Trusted Companions Tried to Kill Me, but Thanks to the Gift of an Unlimited Gacha I Got LVL 9999 Friends and Am Out for Revenge on My Former Party Members and the World is better?
Yep, it's like getting a commoner from the street evaluate a literature PhD in their native language. Sure, both know the language, but the depth difference of a specialist vs a generalist is too large. And neither we can't use AI to automatically evaluate this literature genius because real AI doesn't exist (yet), hence the programs can't understand the contents of text they output or input. Whoops. :)
The average human is a moron you wouldn't trust to watch your hamster. If you watched them outside of the narrow range of tasks they have been trained to perform by rote you would probably conclude they should qualify for benefits by virtue of mental disability.
We give them WAY too much credit by watching mostly the things they have been trained specifically to do and pretending this indicates a general mental competence that just doesn't exist.
The inevitable outcome of regulation on building data centers in the US is that they will be built in the Gulf states, China, or wherever else it is cheaper and better.
They should do a 95% and 99% version of the graphs, otherwise it's hard to ascertain whether the failure cases will remain in the elusive "stuff humans can do easily but LLM's trip up despite scaling"
Those are all expensive because of artificial barriers meant to keep their prices high. Go to any Asian country and houses, healthcare and cars are priced like commodities, not luxuries.
Tech and AI have taken off in the US partially because they’re in the domain of software, which hasnt bee regulated to the point of deliberate inefficiency like
other industries in the US.
If we had less regulation of insurance companies, do you think they’d be cheaper?
(I pick this example because our regulation of insurance companies has (unintuitively) incentivized them to pay more for care. So it’s an example of poor regulation imo)
Health care is the more complicated one of the examples cited, but housing definitely is an 'own goal' in how we made it too difficult to build in too many places - especially "up and in" rather than outward expansion.
Health care is complicated, but I don't think it would hard to understand how less regulations could lower prices. More insurers could enter markets, could compete across state lines, and compliance costs could be lowered.
However regulation is helpful for those already sick or with pre-existing conditions. Developed countries with well-regulated systems also have better health outcomes than the US does.
Well, they'd be more functional as insurance, at least! The way insurance is supposed to work is that your insurance premium is proportional to the risk. You can't go uninsured and then after discovering that your house is on fire and about to burn down, sign up for an insurance plan and expect it to be covered.
We've blundered into a system that has the worst parts of socialized health care and private health insurance without any of the benefits.
> Go to any Asian country and houses, healthcare and cars are priced like commodities, not luxuries.
What do you mean? Several Asian cities have housing crises far worse than the US in local purchasing power, and I'd even argue that a "cheap" home in many Asian countries is going to be of a far lower quality than a "cheap" home in the US.
you mean the same Asia that has the same problem? USA enjoying arbitrage is not actually a solution nor is it sustainable. not to mention that if you control for certain things, like house size for instance relative to inflation adjusted income it isn't actually much different despite popular belief.
It honestly reminds me of the opinion pieces put out by encyclopedia companies about the many ways Wikipedia was inferior.
I read an article that pretended to objectively compare them. It noted that Wikipedia (at that time) had more articles, but not way more... A brief "sampling test" suggested EB was marginally more accurate than Wikipedia - marginally!
The article concluded that EB was superior. Which is what the author was paid to conclude, obviously. "This free tool is marginally better in some ways, and slightly worse in others, than this expensive tool - so fork over your cash!"
Boomers in the manager class love AI because it sells the promise of what they've longed for for decades: a perfect servant that produces value with no salary, no need for breaks, no pushback, no workers comp suits, etc.
The thing is, AI did suck in 2023, and even in 2024, but recently the best AI models are veering into not sucking territory, which when you look at it from a distance makes sense, eventually if you throw the smartest researchers on the planet and billions of dollars at a problem, something eventually will give and the wheels will start turning.
There is a strange blindness many people have on here, a steadfast belief that AI just will just never end up working or always be a scam, but the massive capex on AI now is predicated on the eventual turning of the fledgling LLM's into self-adaptive systems that can manage any cognitive task better than a human. I don't see how the improvements we've seen over the past few years in AI aren't surely heading in that direction.
It still kinda sucks though. You can make it work, but you can also easily end up wasting a huge amount of time trying to make it do something that it's just incapable of. And it's impossible to know upfront if it will work. It's more like gambling.
I personally think we have reached some kind of local maximum. I work 8 hours per day with claude code, so I'm very much aware of even subtle changes in the model. Taking into account how much money was thrown at it, I can't see much progress in the last few model iterations. Only the "benchmarks" are improving, but the results I'm getting are not. If I care about some work, I almost never use AI. I also watch a lot of people streaming online to pick up new workflows and often they say something like "I don't care much about the UI, so I let it just do its thing". I think this tells you more about the current state of AI for coding than anything else. Far from _not sucking_ territory.
> recently the best AI models are veering into not sucking territory
I agree with your assessment.
I find it absolutely wild that 'it almost doesn't entirely suck, if you squint' is suddenly an acceptable benchmark for a technology to be unleashed upon the public.
We have standards for cars, speakers, clothing, furniture, make up, even literature.
Someone can't just type up a few pages of dross and put it though 100 letterboxes without being liable for littering and nuisance. The EU and UK don't allow someone to still phones with a pre-imstalled app that almost performs a function that some users might theoretically want. The public domain has quality standards.
Or rather, it had quality standards. But it's apparently legal to put semi-functioning data-collectors in technologies where nobody asked for them, why isn't it legal to sell chairs that collapse unless you hold them a specific way, clothes that don't actually function as clothes but could be used to make actual clothes by a competent tailor, headphones that can be coaxed into sporadically producing round for minutes at a time?
Either something works too a professional standard or it doesn't.
If it doesn't, it is/was not legal to include it in consumer products.
This is why people are more angry than is justified by a single unreliable program.
I don't care that much whether LLM's perform the functions that are advertised (and they don't, half the time).
I care that after many decades of living in a first world country with consumer protection and minimum standards, all of that seems to have been washed away in the AI wave. When it receeds, we will be left paying first world prices for third world enquiring, now the acceptable quality standard for everything seems to have dropped to 'it can almost certainly be used for its intended purpose at least some times, by some people, with a little effort'.
In the US you also are not arrested for social media posts like you are in the UK or other parts of Europe.
reply