I’ve used Sonnet 4.5 and Codex 5 and 5.1, but not in their native environment [1].
Setting aside the fact that your examples are mostly “replicate this existing thing in language X” [2], again, I’m not saying that the models haven’t gotten better at crapping out code, or that they’re not useful tools. I use them every day. They're great tools, when someone actually intelligent is using them. I also freely concede that they're better tools than a year ago.
The devil is (as always) in the details: how many prompts did it take? what exactly did you have to prompt for? how closely did you look at the code? how closely did you test the end result? Remember that I can, with some amount of prompting, generate perfectly acceptable code for a complex, real-world app, using only GPT 4. But even the newest models generate absolute bullshit on a fairly regular basis. So telling me that you did something complex with an unspecified amount of additional prompting is fine, but not particularly responsive to the original claim.
[1] Copilot, with a liberal sprinkling of ChatGPT in the web UI. Please don’t engage in “you’re holding it wrong” or "you didn't use the right model" with me - I use enough frontier models on a regular basis to have a good sense of their common failings and happy paths. Also, I am trying to do something other than experiment with models, so if I have to switch environments every day, I’m not doing it. If I have to pay for multiple $200 memberships, I’m not doing it. If they require an exact setup to make them “work”, I am unlikely to do it. Finally, if your entire argument here hinges on a point release of a specific model in the last six weeks…yeah. Not gonna take that seriously, because it's the same exact argument, every six weeks. </caveats>
[2] Nothing really wrong with this -- most programming is an iterative exercise of replicating pre-existing things with minor tweaks -- but we're pretty far into the bailey now, I think. The original argument was that you can one-shot a complex application. Now we're in "I can replicate a large pre-existing thing with repeated hand-holding". Fine, and completely within my own envelope for model performance, but not really the original claim.
I know you said don't engage in "you're holding it wrong"... but have you tried these models running in a coding agent tool loop with automatic approvals turned on?
Copilot style autocomplete or chatting with a model directly is an entirely different experience from letting the model spend half an hour writing code, running that code and iterating on the result uninterrupted.
Here's an example where I sent a prompt at 2:38pm and it churned away for 7 minutes (executing 17 bash commands), then I gave it another prompt and it churned for half an hour and shipped 7 commits with 160 passing tests: https://static.simonwillison.net/static/2025/claude-code-mic...
> I know you said don't engage in "you're holding it wrong"... but have you tried these models running in a coding agent tool loop with automatic approvals turned on?
edit: I wrote a different response here, then I realized we might be talking about different things.
Are you asking if I let the agents use tools without my prior approval? I do that for a certain subset of tools (e.g. run tests, do requests, run queries, certain shell commands, even use the browser if possible), but I do not let the agents do branch merges, deploys, etc. I find that the best models are just barely good enough to produce a bad first draft of a multi-file feature (e.g. adding an entirely new controller+view to a web app), and I would never ever consider YOLOing their output to production unless I didn't care at all. I try to get to tests passing clean before even looking at the code.
Also, I am happy to let Copilot burn tokens in this manner and will regularly do it for refactors or initial drafts of new features, I'm honestly not sure if the juice is worth the squeeze -- I still typically have to spend substantial time reworking whatever they create, and the revision time required scales with the amount of time they spend spinning. If I had to pay per token, I'd be much more circumspect about this approach.
Yes, that's what I meant. I wasn't sure if you meant classic tab-based autocomplete or Copilot tool-based agent Copilot.
Letting it burn tokens on running tests and refactors (but not letting it merge branches or deploy) is the thing that feels like a huge leap forward to me. We are talking about the same set of capabilities.
For me it is something I can describe in a single casual prompt.
For example I wrote a fully working version of https://tools.nicklothian.com/llm_comparator.html in a single prompt. I refined it and added features with more prompts, but it worked from the start.
Good question. No strict line, and it's always going to be subjective and a little bit silly to categorize, but when I'm debating this argument I'm thinking: a product that does not exist today (obviously many parts of even a novel product will be completely derivative, and that's fine), with multiple views, controllers, and models, and a non-trivial amount of domain-specific business logic. Likely 50k+ lines of code, but obviously that's very hand-wavy and not how I'd differentiate.
Think: SaaS application that solves some domain specific problem in corporate accounting, versus "in-browser speadsheet", or "first-person shooter video game with AI, multi-player support, editable levels, networking and high-resolution 3D graphics" vs "flappy bird clone".
When you're working on a product of this size, you're probably solving problems like the ones cited by simonw multiple times a week, if not daily.
But re-reading your statement you seem to be claiming that there are no 50k SAAS apps that are build even using multi-shot techniques (ie, building a feature at a time).
- It's 45K of python code
- It isn't a duplicate of another program (indeed, the reason it isn't finished is because it is stuck between ISO Prolog and SWI Prolog and I need to think about how to resolve this, but I don't know enough Prolog!)
- Not a *single* line of code is hand written.
Ironically this doesn't really prove that the current frontier models are better because large amounts of code were written with non-frontier models (You can sort of get an idea of what models were used with the labels on https://github.com/nlothian/Vibe-Prolog/pulls?q=is%3Apr+is%3...)
But - importantly - this project is what convinced me that the frontier models are much better than the previous generation. There were numerous times I tried the same thing in a non-Frontier model which couldn't do it, and then I'd try it in Claude, Codex or Gemini and it would succeed.
Setting aside the fact that your examples are mostly “replicate this existing thing in language X” [2], again, I’m not saying that the models haven’t gotten better at crapping out code, or that they’re not useful tools. I use them every day. They're great tools, when someone actually intelligent is using them. I also freely concede that they're better tools than a year ago.
The devil is (as always) in the details: how many prompts did it take? what exactly did you have to prompt for? how closely did you look at the code? how closely did you test the end result? Remember that I can, with some amount of prompting, generate perfectly acceptable code for a complex, real-world app, using only GPT 4. But even the newest models generate absolute bullshit on a fairly regular basis. So telling me that you did something complex with an unspecified amount of additional prompting is fine, but not particularly responsive to the original claim.
[1] Copilot, with a liberal sprinkling of ChatGPT in the web UI. Please don’t engage in “you’re holding it wrong” or "you didn't use the right model" with me - I use enough frontier models on a regular basis to have a good sense of their common failings and happy paths. Also, I am trying to do something other than experiment with models, so if I have to switch environments every day, I’m not doing it. If I have to pay for multiple $200 memberships, I’m not doing it. If they require an exact setup to make them “work”, I am unlikely to do it. Finally, if your entire argument here hinges on a point release of a specific model in the last six weeks…yeah. Not gonna take that seriously, because it's the same exact argument, every six weeks. </caveats>
[2] Nothing really wrong with this -- most programming is an iterative exercise of replicating pre-existing things with minor tweaks -- but we're pretty far into the bailey now, I think. The original argument was that you can one-shot a complex application. Now we're in "I can replicate a large pre-existing thing with repeated hand-holding". Fine, and completely within my own envelope for model performance, but not really the original claim.