More

simonw · 2026-02-18T00:05:38 1771373138

Took me a while to create the pelican because I was busy adding Opus/Sonnet 4.6 support to my plugin for https://llm.datasette.io/ - pelican now available here, it's not quite as good as the Opus 4.6 one but does look equivalent to the Opus 4.5 one - and it has a snazzy top hat. https://simonwillison.net/2026/Feb/17/claude-sonnet-46/

mohsen1 · 2026-02-18T00:06:52 1771373212

top hat was there in another attempt I saw in the comments here.

simonw · 2026-02-17T20:12:32 1771359152

One that lets me use it in my open source projects without then preventing other people from using my open source projects in their closed source projects.

Using your library currently completely disrupts the licensing situation for my own work.

simonw · 2026-02-17T18:01:34 1771351294

Jimmy Carter put his peanut farm in a blind trust.

simonw · 2026-02-17T16:45:40 1771346740

I'd like to see some concrete examples that illustrate this - as it stands this feels like an opinion piece that doesn't attempt to back up its claims.

(Not necessarily disagreeing with those claims, but I'd like to see a more robust exploration of them.)

barrkel · 2026-02-17T17:02:18 1771347738

Have you not seen it any time you put any substantial bit of your own writing through an LLM, for advice?

I disagree pretty strongly with most of what an LLM suggests by way of rewriting. They're absolutely appalling writers. If you're looking for something beyond corporate safespeak or stylistic pastiche, they drain the blood out of everything.

The skin of their prose lacks the luminous translucency, the subsurface scattering, that separates the dead from the living.

simonw · 2026-02-17T19:01:54 1771354914

The prompt I use for proof-reading has worked great for me so far:

  You are a proof reader for posts
  about to be published.

  1. Identify for spelling mistakes
  and typos
  2. Identify grammar mistakes
  3. Watch out for repeated terms like
  "It was interesting that X, and it
  was interesting that Y"
  4. Spot any logical errors or
  factual mistakes
  5. Highlight weak arguments that
  could be strengthened
  6. Make sure there are no empty or
  placeholder links

matwood · 2026-02-17T18:02:42 1771351362

> If you're looking for something beyond corporate safespeak

AI has been great for removing this stress. "Tell Joe no f'n way" in a professional tone and I can move on with my day.

ziml77 · 2026-02-17T20:52:17 1771361537

If you tell me "no fucking way" by running it through an LLM, I will be far more pissed than if you had just sent me "no fucking way". At least in that case I know a human read and responded rather than thinking my email was just being processed by a damned robot.

dsf2d · 2026-02-17T18:26:58 1771352818

Yeah but does it make sense to have invested all this money for this?

Lol no. Might be great for you as a consumer who is using these products for free. But expand the picture more.

matwood · 2026-02-17T19:12:58 1771355578

> Yeah but does it make sense to have invested all this money for this?

No, but it's here. Why wouldn't I use it?

Terretta · 2026-02-17T18:26:27 1771352787

> If you're looking for something beyond corporate safespeak or stylistic pastiche, they drain the blood out of everything.

Strong agree, which is why I disagree with this OP point:

“Stage 2: Lexical flattening. Domain-specific jargon and high-precision technical terms are sacrificed for "accessibility." The model performs a statistical substitution, replacing a 1-of-10,000 token with a 1-of-100 synonym, effectively diluting the semantic density and specific gravity of the argument.”

I see enough jargon in everyday business email that in the office zero-shot LLM unspoolings can feel refreshing.

I have "avoid jargon and buzzwords" as one of very tiny tuners in my LLM prefs. I've found LLMs can shed corporate safespeak, or even add a touch of sparkle back to a corporate memo.

Otherwise very bright writers have been "polished" to remove all interestingness by pre-LLM corporate homogenization. Give them a prompt to yell at them for using 1-in-10 words instead of 1-in-10,000 "perplexity" and they can tune themselves back to conveying more with the same word count. Results… scintillate.

furyofantares · 2026-02-17T18:05:53 1771351553

Look through my comment history at all the posts where I complain the author might have had something interesting to say but it's been erased by the LLM and you can no longer tell what the author cared about because the entire post is a an oversold monotone advertising voice.

https://news.ycombinator.com/item?id=46583410#46584336

https://news.ycombinator.com/item?id=46605716#46609480

https://news.ycombinator.com/item?id=46617456#46619136

https://news.ycombinator.com/item?id=46658345#46662218

https://news.ycombinator.com/item?id=46630869#46663276

https://news.ycombinator.com/item?id=46656759#46663322

https://news.ycombinator.com/item?id=46661936#46663362

https://news.ycombinator.com/item?id=46748077#46749699

internet_points · 2026-02-17T20:03:07 1771358587

I just sent TFA to a colleague of mine who was experimenting with llm's for auto-correcting human-written text, since she noticed the same phenomenon where it would correct not only mistakes, but slightly nudge words towards more common synonyms. It would often lose important nuances, so "shun" would be corrected to "avoid", and "divulge" would become "disclose" etc.

gdulli · 2026-02-17T17:10:03 1771348203

Kaffee: Corporal, would you turn to the page in this book that says where the mess hall is, please?

Cpl. Barnes: Well, Lt. Kaffee, that's not in the book, sir.

Kaffee: You mean to say in all your time at Gitmo, you've never had a meal?

Cpl. Barnes: No, sir. Three squares a day, sir.

Kaffee: I don't understand. How did you know where the mess hall was if it's not in this book?

Cpl. Barnes: Well, I guess I just followed the crowd at chow time, sir.

Kaffee: No more questions.

NitpickLawyer · 2026-02-17T16:51:17 1771347077

It is an opinion piece. By a dude working as a "Professor of Pharmaceutical Technology and Biomaterials at the University of Ferrara".

It has all the tropes of not understanding the underlying mechanisms, but repeating the common tropes. Quite ironic, considering what the author's intended "message" is. Jpeg -> jpeg -> jpeg bad. So llm -> llm -> llm must be bad, right?

It reminds me of the media reception of that paper on model collapse. "Training on llm generated data leads to collapse". That was in 23 or 24? Yet we're not seeing any collapse, despite models being trained mainly on synthetic data for the past 2 years. That's not how any of it works. Yet everyone has an opinion on how bad it works. Jesus.

It's insane how these kinds of opinion pieces get so upvoted here, while worth-while research, cool positive examples and so on linger in new with one or two upvotes. This has ceased to be a technical subject, and has moved to muh identity.

simonw · 2026-02-17T16:56:54 1771347414

Yeah, reading the other comments on this thread this is a classic example of that Hacker News (and online forums in general) thing where people jump on the chance to talk about a topic driven purely by the headline without engaging with the actual content.

(I'm frequently guilty of that too.)

ghywertelling · 2026-02-17T17:02:44 1771347764

Even if that isn't the case, isn't it the fact the AI labs don't want their models to be edgy in any creative way, choose a middle way (buddhism) so to speak. Are there AI labs who are training their models to be maximally creative?

PurpleRamen · 2026-02-17T17:10:28 1771348228

> Yet we're not seeing any collapse, despite models being trained mainly on synthetic data for the past 2 years.

Maybe because researchers learned from the paper to avoid the collapse? Just awareness alone often helps to sidestep a problem.

NitpickLawyer · 2026-02-17T17:21:08 1771348868

No one did what the paper actually proposed. It was a nothing burger in the industry. Yet it was insanely popular on social media.

Same with the "llms don't reason" from "Apple" (two interns working at Apple, but anyway). The media went nuts over it, even though it was littered with implementation mistakes and not worth the paper it was(n't) printed on.

dsf2d · 2026-02-17T18:25:18 1771352718

Who cares? This is a place where you should be putting forth your own perspective based on your own experience. Not parotting what someone else already wrote.

simonw · 2026-02-16T21:15:07 1771276507

Two more recent articles by this author:

https://0byte.io/articles/neuron.html

https://0byte.io/articles/helloml.html

He also publishes to YouTube where he has clear explanations and high production values that deserve more views.

https://www.youtube.com/watch?v=dES5Cen0q-Y (part 2 https://www.youtube.com/watch?v=-HhE-8JChHA) is the video to accompany https://0byte.io/articles/helloml.html

knickerbockeroo · 2026-02-16T22:54:41 1771282481

Very nice. Thanks for sharing.

simonw · 2026-02-16T20:10:51 1771272651

TIL a new shorthand for "the real problem is capitalism", thanks!

simonw · 2026-02-16T19:51:37 1771271497

I know it's popular comparing coding agents to slot machines right now, but the comparison doesn't entirely hold for me.

It's more like being hooked on a slot machine which pays out 95% of the time because you know how to trick it.

(I saw "no actual evidence pointing to these improvements" with a footnote and didn't even need to click that footnote to know it was the METR thing. I wish AI holdouts would find a few more studies.)

Steve Yegge of all people published something the other day that has similar conclusions to this piece - that the productivity boost for coding agents can lead to burnout, especially if companies use it to drive their employees to work in unsustainable ways: https://steve-yegge.medium.com/the-ai-vampire-eda6e4f07163

saulpw · 2026-02-16T20:46:07 1771274767

Yeah I'm finding that there's "clock time" (hours) and "calendar time" (days/weeks/months) and pushing people to work 'more' is based on the fallacy that our productivity is based on clock time (like it is in a factory pumping out widgets) rather than calendar time (like it is in art and other creative endeavors). I'm finding that even if the LLM can crank out my requested code in an hour, I'll still need a few days to process how it feels to use. The temptation is to pull the lever 10 times in a row because it was so easy, but now I'll need a few weeks to process the changes as a human. This is just for my own personal projects, and it makes sense that the business incentives would be even more intense. But you can't get around the fact that, no matter how brilliant your software or interface, customers are not going to start paying in a few hours.

simonw · 2026-02-16T20:55:59 1771275359

> The temptation is to pull the lever 10 times in a row because it was so easy, but now I'll need a few weeks to process the changes as a human

Yeah I really feel that!

I recently learned the term "cognitive debt" for this from https://margaretstorey.com/blog/2026/02/09/cognitive-debt/ and I think it's a great way to capture this effect.

I can churn out features faster, but that means I don't get time to fully absorb each feature and think through its consequences and relationships to other existing or future features.

mrbungie · 2026-02-16T19:59:30 1771271970

If you are really good and fast validating/fixing code output or you are actually not validating it more than just making sure it runs (no judging), I can see it paying out 95% of the time.

But for what I've seen both validating my and others coding agents outputs I'd estimate a much lower percentage (Data Engineering/Science work). And, oh boy, some colleages are hooked to generating no matter the quality. Workslop is a very real phenomenon.

biophysboy · 2026-02-16T20:19:00 1771273140

This matches my experience using LLMs for science. Out of curiosity, I downloaded a randomized study and the CONSORT checklist, and asked Claude code to do a review using the checklist.

I was really impressed with how it parsed the structured checklist. I was not at all impressed by how it digested the paper. Lots of disguised errors.

baq · 2026-02-16T20:59:34 1771275574

try codex 5.3. it's dry and very obviously AI; if you allow a bit of anthropomorphisation, it's kind of high-functioning autistic. it isn't an oracle, it'll still be wrong, but it's a powerful, completely different from claude tool.

biophysboy · 2026-02-16T21:03:23 1771275803

Does it get numbers right? One of the mistakes it made in reading the paper was swapping sets of numbers from the primary/secondary outcomes.

baq · 2026-02-16T21:11:30 1771276290

it does get screenshots right for me, but obviously I haven't tried on your specific paper. I can only recommend trying it out, it's also has a much more generous limits in the $20 tier than opus.

biophysboy · 2026-02-16T21:44:42 1771278282

I see. To clarify, it parsed numbers in the pdf correct, but assigned them the wrong meaning. I was wondering if codex is better at interpreting non text data

enraged_camel · 2026-02-17T10:05:35 1771322735

Every time someone suggests Codex I give it a shot. And every time it disappoints.

After I read your comment, I gave Codex 5.3 the task of setting up an E2E testing skeleton for one of my repos, using Playwright. It worked for probably 45 minutes and in the end failed miserably: out of the five smoke tests it created, only two of them passed. It gave up on the other three and said they will need “further investigation”.

I then stashed all do that code and gave the exact same task to Opus 4.5 (not even 4.6), with the same prompt. After 15 mins it was done. Then I popped Codex’s code from the stash and asked Opus to look at it to see why the three m of the five tests Codex wrote didn’t pass. It looked at them and found four critical issues that Codex had missed. For example, it had failed to detect that my localhost uses https, so the the E2E suite’s API calls from the Vue app kept failing. Opus also found that the two passing tests were actually invalid: they checked for the existence of a div with #app and simply assumed it meant the Vue app booted successfully.

This is probably the dozenth comparison I’ve done between Codex and Opus. I think there was only one scenario where Codex performed equally well. Opus is just a much better model in my experience.

baq · 2026-02-17T11:38:41 1771328321

moral of the story is use both (or more) and pick the one that works - or even merge the best ideas from generated solutions. independent agentic harnesses support multi-model workflows.

enraged_camel · 2026-02-17T12:33:27 1771331607

I don't think that's the moral of the story at all. It's already challenging enough to review the output from one model. Having to review two, and then comparing and contrasting them, would more than double the cognitive load. It would also cost more.

I think it's much more preferable to pick the most reliable one and use it as the primary model, and think of others as fallbacks for situations where it struggles.

baq · 2026-02-17T12:50:16 1771332616

you should always benchmark your use cases and you obviously don't review multiple outputs; you only review the consensus.

see how perplexity does it: https://www.perplexity.ai/hub/blog/introducing-model-council

r00tanon · 2026-02-16T21:40:46 1771278046

I was going to mention Yegge's recent blog posts mirroring this phenomena.

There's also this article on hbr.org https://hbr.org/2026/02/ai-doesnt-reduce-work-it-intensifies...

This is a real thing, and it looks like classic addiction.

fdefitte · 2026-02-16T20:12:11 1771272731

That 95% payout only works if you already know what good looks like. The sketchy part is when you can't tell the diff between correct and almost-correct. That's where stuff goes sideways.

energy123 · 2026-02-17T07:40:17 1771314017

Being on a $200 plan is a weird motivator. Seeing the unused weekly limit for codex and the clock ticking down, and knowing I can spam GPT 5.2 Pro "for free" because I already paid for it.

Retr0id · 2026-02-16T20:15:34 1771272934

It's 95% if you're using it for the stuff it's good at. People inevitably try to push it further than that (which is only natural!), and if you're operating at/beyond the capability frontier then the success rate eventually drops.

Kiro · 2026-02-16T20:26:22 1771273582

Just need to point out that the payout is often above 95% at online casinos. As long as it's below 100 the house still wins.

mikkupikku · 2026-02-16T21:00:21 1771275621

He means a slot machine that pays you 95% of the time, not a slot machine that pays out 95% of what you put in.

Claude Code wasting my time with nonsense output one in twenty times seems roughly correct. The rest of the time it's hitting jackpots.

fy20 · 2026-02-16T21:01:07 1771275667

> It's more like being hooked on a slot machine which pays out 95% of the time because you know how to trick it

Right but the <100% chance is actually why slot machines are addictive. If it pays out continuously the behaviour does not persist as long. It's called the partial reinforcement extinction effect.

jrflowers · 2026-02-16T20:15:41 1771272941

> It's more like being hooked on a slot machine which pays out 95% of the time because you know how to trick it.

“It’s not like a slot machine, it’s like… a slot machine… that I feel good using”

That aside if a slot machine is doing your job correctly 95% of the time it seems like either you aren’t noticing when it’s doing your job poorly or you’ve shifted the way that you work to only allow yourself to do work that the slot machine is good at.

globular-toast · 2026-02-17T16:08:40 1771344520

> It's more like being hooked on a slot machine which pays out 95% of the time because you know how to trick it.

I think you are mistaken on what the "payout" is. There's only one reason someone is working all hours and during a party and whatnot: it's to become rich and powerful. The payout is not "more code", it's a big house, fast cars, beautiful women etc. Nobody can trick it into paying out even 1% of the time, let alone 95%.

zem · 2026-02-17T08:06:09 1771315569

thanks, that steve yegge piece was a very good read.

simonw · 2026-02-16T13:12:49 1771247569

Have it maintain and run a test suite.

simonw · 2026-02-16T13:09:29 1771247369

"Implements + tests against sqlite3 as oracle"

That's the real unlock in my opinion. It's effectively an automated reverse engineering of how SQLite behaves, which is something agents are really good at.

I did a similar but smaller project a couple of weeks ago to build a Python library that could parse a SQLite SELECT query into an AST - same trick, I ran the SQLite C code as an oracle for how those ASTs should work: https://github.com/simonw/sqlite-ast

Question: you mention the OpenAI and Anthropic Pro plans, was the total cost of this project in the order of $40 ($20 for OpenAI and $20 for Anthropic)? What did you pay for Gemini?

kyars · 2026-02-16T21:34:10 1771277650

yes, in the order of $50 let's say, although with api I believe it would be in the hundreds

Gemini is free, I don't even know if they have a paid plan?

simonw · 2026-02-16T12:58:09 1771246689

Pelican is OK, not a good bicycle: https://gist.github.com/simonw/67c754bbc0bc609a6caedee16fef8...

oidar · 2026-02-16T16:08:41 1771258121

How much more do you know about pelicans now than when you first started doing this?

simonw · 2026-02-16T19:20:35 1771269635

Lots more but not because of the benchmark - I live in Half Moon Bay, CA which turns out to have the second largest mega-roost of the California Brown Pelican (at certain times of year) and my wife and I befriended our local pelican rescue expert and helped on a few rescues.

nikhilsimha · 2026-02-17T04:28:09 1771302489

easily the most memorable comment i have ever seen on hackernews so far. kudos good sir!

tarruda · 2026-02-16T13:22:22 1771248142

At this point I wouldn't be surprised if your pelican example has leaked into most training datasets.

I suggest to start using a new SVG challenge, hopefully one that makes even Gemini 3 Deep Think fail ;D

jon-wood · 2026-02-16T14:08:04 1771250884

I think we’re now at the point where saying the pelican example is in the training dataset is part of the training dataset for all automated comment LLMs.

Mossly · 2026-02-16T21:54:58 1771278898

It's quite amusing to ask LLMs what the pelican example is and watch them hallucinate a plausible sounding answer.

---

Qwen 3.5: "A user asks an LLM a question about a fictional or obscure fact involving a pelican, often phrased confidently to test if the model will invent an answer rather than admitting ignorance." <- How meta

Opus 4.6: "Will a pelican fit inside a Honda Civic?"

GPT 5.2: "Write a limerick (or haiku) about a pelican."

Gemini 3 Pro: "A man and a pelican are flying in a plane. The plane crashes. Who survives?"

Minimax M2.5: "A pelican is 11 inches tall and has a wingspan of 6 feet. What is the area of the pelican in square inches?"

GLM 5: "A pelican has four legs. How many legs does a pelican have?"

Kimi K2.5: "A photograph of a pelican standing on the..."

---

I agree with Qwen, this seems like a very cool benchmark for hallucinations.

ertgbnm · 2026-02-16T14:59:00 1771253940

I'm guessing it has the opposite problem of typical benchmarks since there is no ground truth pelican bike svg to over fit on. Instead the model just has a corpus of shitty pelicans on bikes made by other LLMs that it is mimicking.

So we might have an outer alignment failure.

WarmWash · 2026-02-16T15:43:43 1771256623

Most people seem to have this reflexive belief that "AI training" is "copy+paste data from the internet onto a massive bank of hard drives"

So if there is a single good "pelican on a bike" image on the internet or even just created by the lab and thrown on The Model Hard Drive, the model will make a perfect pelican bike svg.

The reality of course, is that the high water mark has risen as the models improve, and that has naturally lifted the boat of "SVG Generation" along with it.

Wowfunhappy · 2026-02-16T20:28:15 1771273695

How would that work? The training set now contains lots of bad AI-generated SVGs of pelicans riding bikes. If anything, the data is being poisoned.

thomasahle · 2026-02-16T23:09:22 1771283362

We scaled on "virtually all RL tasks and environments we could conceive." - apparently, they didn't conceive of pelican SVG RL.

I've long thought multi-modal LLMs should be strong enough to do RL for TikZ and SVG generation. Maybe Google is doing it.

moffers · 2026-02-16T13:57:30 1771250250

I like the little spot colors it put on the ground

embedding-shape · 2026-02-16T13:26:36 1771248396

How many times do you run the generation and how do you chose which example to ultimately post and share with the public?

simonw · 2026-02-16T14:59:40 1771253980

Once. It's a dice roll for the models.

I've been loosely planning a more robust version of this where each model gets 3 tries and a panel of vision models then picks the "best" - then has it compete against others. I built a rough version of that last June: https://simonwillison.net/2025/Jun/6/six-months-in-llms/#ai-...

canadiantim · 2026-02-16T13:34:59 1771248899

AbstractGeo · 2026-02-16T18:32:18 1771266738

What quantization were you running there, or, was it the official API version?

simonw · 2026-02-16T19:20:58 1771269658

I tested it via OpenRouter https://openrouter.ai/chat?models=qwen/qwen3.5-plus-02-15

m12k · 2026-02-16T17:00:21 1771261221

Axis aligned spokes is certainly a choice

bertili · 2026-02-16T14:21:04 1771251664

Better than frontier pelicans as of 2025