You also have to understand that the foundation of money is debt in the sense that if we paid back all the debt, money wouldn't continue to exist. The sum total of debt exceeds the total quantity of money.
In this case, the behavior is so weird and easy to trigger that I'm sure someone has filed a radar by now. So somebody has at least written a post-hoc justification?
> I don't buy the 10x efficiency thing: they are just lagging behind the performance of current SOTA models. They perform much worse than the current models while also costing much less - exactly what I would expect.
Everyone who's used Opus knows it's better than the others in a way that isn't captured by the benchmarks. I would describe it as taste.
Lots of models get really close on benchmarks, but benchmarks only tell us how good they are at solving a defined problem. Opus is far better at solving ill-defined ones.
One of the main edges Anthropic has is that "personality tuning" gap. "Nice to use" is a differentiator when raw performance isn't.
OpenAI can sometimes get an edge over Anthropic in hard narrow STEM tasks. I trust benchmarks over vibes there - and the benchmarks show the teams trading blows release after release. Tracking Claude Code vs OpenAI Codex on SWE-bench Verified feels like watching the back alley knife fight of the AI frontier.
But the vibe of "how easy is that model to interact with" and "how easy it is to get it to do what you want it to" does matter a lot when you are the one doing the interacting. And Opus makes for a damn good daily driver.
Dunno, I was using Cursor today and for some reason it decided to swith to GPT 5.3 at some point and I didn't even notice. I was sure that Opus is much better, but who knows?
At this point it's frankly not a fair comparison since DeepSeek 3.2 is now many months old and we're waiting for a newer model which has been rumoured as "any day now" since February. (We'll see).
GLM5, the largest Qwen 3.5 model, and Kimi K2.5 are more fair comparisons, though they are, yes, a bit behind. They're more than capable for routine operations though.
Anyways, I'm back to using Opus & Claude Code after a month on Codex/GPT5.3 and 5.4 and it's frankly a rather obvious downgrade. Anthropic is behind OpenAI at this point on coding models, and there's nothing to say they couldn't fall behind the Chinese models as well.
The moat is very shallow. After the events of the last two weeks there's likely a significant % of international capital very interested in breaching it. I know I would like to see this... Anthropic basically said F U to any non-Americans, and OpenAI is ... yeah.
I have a project where we've had Opus, Sonnet, Deepseek, Kimi, Qwen create and execute an aggregate total of about 350 plans so far, and the quality difference as measured in plans where the agent failed to complete the tasks on the first run is high enough that it comes out several times higher than Anthropics subscription prices, but probably cheaper than the API prices once we have improved the harness further - at present the challenge is that too much human intervention for the cheaper models drives up the cost.
My dashboard goes from all green to 50/50 green/red for our agents whenever I switch from Claude to one of the cheaper agents... This is after investing a substantial amount of effort in "dumbing down" the prompts - e.g. adding a lot of extra wording to convince the dumber models to actually follow instructions - that is not necessary for Sonnet or Opus.
I buy the benchmarks. The problem is that a 10% difference in the benchmarks makes the difference between barely usable and something that can consistently deliver working code unilaterally and require few review interventions. Basically, the starting point for "usable" on these benchmarks is already very far up the scale for a lot of tasks.
I do strongly believe the moat is narrow - With 4.6 I switched from defaulting to Opus to defaulting to Sonnet for most tasks. I can fully see myself moving substantial workloads to a future iteration of Kimi, Qwen or Deepseek in 6-12 months once they actually start approaching Sonnet 4.5 level. But for my use at least, currently, they're at best competing with Athropics 3.x models in terms of real-world ability.
That said, even now, I think if we were stuck with current models for 12 months, we might well also be able to build our way around this and get to a point where Deepseek and Kimi would be cheaper than Sonnet.
Eventually we'll converge on good enough harnesses to get away with cheaper models for most uses, and the remaining appeal for the frontier models will be complex planning and actual hard work.
Good point on the green/red dashboard. The opportunity cost angle is worth adding though. A failed run isn't just the wasted tokens and retry cost - it's also the task that didn't get done and the engineering required to diagnose why. On anything time-sensitive, that compounds fast.
Exactly. At the moment it's close enough to be a wash for some cases, or tilts seriously one direction or other for others. I expect improved harnesses means more and more we'll just be able to re-run a couple of times, and fall back to "escalating" to Sonnet or even Opus, but whenever it involves egineering time, that's a big deal.
I still won't use what? I use Opus now, and I will use Opus then too, but as I clearly stated:
My default model has now dropped to Sonnet, because Sonnet can now do most of my tasks, and we already use Kimi, Deepseek, and Qwen.
They're just not cost-effective enough to be my main driver yet. They are however cheap enough that for things where the Claude TOS does not let me use my subscription, they still add substantial value. Just not nearly as much as I'd like.
The bulk of my tasks won't get harder as time passes, and so will move down the value chain as the cheaper models get better.
For the small proportion of my tasks that benefits from a smarter model, I will use the smartest model I can afford.
Thankfully it's not as bad as that. The 50% that goes red means we re-execute those steps, potentially several times, to see if they succeed, before we even bother manually looking at it. But the overall principle holds: First yoou multiply the cost by re-running, then eventually you either need to kick it up to a more expensive model and/or a human.
But of course this is also only viable for non-latency sensitive work, for starters.
The parts that are not hearsay from anonymous sources, which basically means any paranoid story that the FBI still has to document, or blackmailers and grifters with plenty of holes and inconsistencies to their stories, are about about elites partying with 17 and above year olds who where otherwise active already in related "work". Still shady, but hardly what's being reported in the sensationalist coverage, which ranges from abductions and rings to acid baths for murdered victims.
BLASPHEMY
reply