More

coldtea · 2026-03-11T10:17:55 1773224275

>FreeBSD underneath (so, practically, Linux).

BLASPHEMY

coldtea · 2026-03-11T10:16:52 1773224212

>Why does macOS tell you to use the GUI so much?

Because it's whole point is that it's a graphical OS.

If you used just cli unix userland, might as well use Linux.

coldtea · 2026-03-10T23:38:48 1773185928

One interesting actionable takeaway fact is that if you do such a project spending so much time on it, your life outside of it likely sucks.

coldtea · 2026-03-10T23:37:08 1773185828

>as long as we can continue to grow more than that, forever.

"Being chased while on foot by a pack of hungry cheetahs is not that bad, as long as we can keep our distance"

kelseyfrog · 2026-03-11T00:30:32 1773189032

You also have to understand that the foundation of money is debt in the sense that if we paid back all the debt, money wouldn't continue to exist. The sum total of debt exceeds the total quantity of money.

coldtea · 2026-03-10T23:36:02 1773185762

>You may as well keep increasing the number of second fine, because in no earthly circumstance will I ever be able to pay it back.

When that's the case, you'd be surprised what happens when the breaking point comes.

Or do you think countries haven't gone bankrupt before?

coldtea · 2026-03-10T11:34:34 1773142474

>for the life of my I can't understand why y'all care so much about this.

Because we fucking have to see it every day. And the sloppiness compounds and is indicative of further rot.

Of course the different radii also means different code paths were used, which points to a mess of APIs and frameworks underneath too.

And that's before we add the usability issues (like hard to read labels due to the glass effect and such, or bizare dragging boundaries, etc).

>Doesn't that seem a bit... particular?

Good software is about being particular.

If we wanted any random crap, we'd use any random crap.

coldtea · 2026-03-10T11:33:12 1773142392

Some do, yes.

coldtea · 2026-03-10T11:32:09 1773142329

>I guess I don't necessarily hate it, it's more of a neutral thing, but who is deciding these strange things??

Probably nobody, just some artifact of the overlay APIs used default behavior that they didn't bother to streamline.

StilesCrisis · 2026-03-10T11:45:25 1773143125

In this case, the behavior is so weird and easy to trigger that I'm sure someone has filed a radar by now. So somebody has at least written a post-hoc justification?

coldtea · 2026-03-10T11:16:53 1773141413

> I don't buy the 10x efficiency thing: they are just lagging behind the performance of current SOTA models. They perform much worse than the current models while also costing much less - exactly what I would expect.

Define "much worse".

  +--------------------------------------+-------------+-----------+------------------+
  | Benchmark                            | Claude Opus | DeepSeek  | DeepSeek vs Opus |
  +--------------------------------------+-------------+-----------+------------------+
  | SWE-Bench Verified (coding)          | 80.9%       | 73.1%     | ~90%                 |
  | MMLU (knowledge)                     | ~91         | ~88.5     | ~97%               |
  | GPQA (hard science reasoning)        | ~79–80      | ~75–76    | ~95%             |
  | MATH-500 (math reasoning)            | ~78         | ~90       | ~115%            |
  +--------------------------------------+-------------+-----------+------------------+

Filligree · 2026-03-10T11:29:06 1773142146

Everyone who's used Opus knows it's better than the others in a way that isn't captured by the benchmarks. I would describe it as taste.

Lots of models get really close on benchmarks, but benchmarks only tell us how good they are at solving a defined problem. Opus is far better at solving ill-defined ones.

ACCount37 · 2026-03-10T13:44:03 1773150243

One of the main edges Anthropic has is that "personality tuning" gap. "Nice to use" is a differentiator when raw performance isn't.

OpenAI can sometimes get an edge over Anthropic in hard narrow STEM tasks. I trust benchmarks over vibes there - and the benchmarks show the teams trading blows release after release. Tracking Claude Code vs OpenAI Codex on SWE-bench Verified feels like watching the back alley knife fight of the AI frontier.

But the vibe of "how easy is that model to interact with" and "how easy it is to get it to do what you want it to" does matter a lot when you are the one doing the interacting. And Opus makes for a damn good daily driver.

torginus · 2026-03-10T15:55:17 1773158117

Dunno, I was using Cursor today and for some reason it decided to swith to GPT 5.3 at some point and I didn't even notice. I was sure that Opus is much better, but who knows?

cmrdporcupine · 2026-03-10T14:27:42 1773152862

At this point it's frankly not a fair comparison since DeepSeek 3.2 is now many months old and we're waiting for a newer model which has been rumoured as "any day now" since February. (We'll see).

GLM5, the largest Qwen 3.5 model, and Kimi K2.5 are more fair comparisons, though they are, yes, a bit behind. They're more than capable for routine operations though.

Anyways, I'm back to using Opus & Claude Code after a month on Codex/GPT5.3 and 5.4 and it's frankly a rather obvious downgrade. Anthropic is behind OpenAI at this point on coding models, and there's nothing to say they couldn't fall behind the Chinese models as well.

The moat is very shallow. After the events of the last two weeks there's likely a significant % of international capital very interested in breaching it. I know I would like to see this... Anthropic basically said F U to any non-Americans, and OpenAI is ... yeah.

coldtea · 2026-03-10T11:36:52 1773142612

>Everyone who's used Opus knows it's better than the others in a way that isn't captured by the benchmarks. I would describe it as taste.

Ah, the "trust me bro" advantage. Couldn't it just be brand identity and familiarity?

vidarh · 2026-03-10T12:18:04 1773145084

I have a project where we've had Opus, Sonnet, Deepseek, Kimi, Qwen create and execute an aggregate total of about 350 plans so far, and the quality difference as measured in plans where the agent failed to complete the tasks on the first run is high enough that it comes out several times higher than Anthropics subscription prices, but probably cheaper than the API prices once we have improved the harness further - at present the challenge is that too much human intervention for the cheaper models drives up the cost.

My dashboard goes from all green to 50/50 green/red for our agents whenever I switch from Claude to one of the cheaper agents... This is after investing a substantial amount of effort in "dumbing down" the prompts - e.g. adding a lot of extra wording to convince the dumber models to actually follow instructions - that is not necessary for Sonnet or Opus.

I buy the benchmarks. The problem is that a 10% difference in the benchmarks makes the difference between barely usable and something that can consistently deliver working code unilaterally and require few review interventions. Basically, the starting point for "usable" on these benchmarks is already very far up the scale for a lot of tasks.

I do strongly believe the moat is narrow - With 4.6 I switched from defaulting to Opus to defaulting to Sonnet for most tasks. I can fully see myself moving substantial workloads to a future iteration of Kimi, Qwen or Deepseek in 6-12 months once they actually start approaching Sonnet 4.5 level. But for my use at least, currently, they're at best competing with Athropics 3.x models in terms of real-world ability.

That said, even now, I think if we were stuck with current models for 12 months, we might well also be able to build our way around this and get to a point where Deepseek and Kimi would be cheaper than Sonnet.

Eventually we'll converge on good enough harnesses to get away with cheaper models for most uses, and the remaining appeal for the frontier models will be complex planning and actual hard work.

oren1531 · 2026-03-10T13:23:45 1773149025

Good point on the green/red dashboard. The opportunity cost angle is worth adding though. A failed run isn't just the wasted tokens and retry cost - it's also the task that didn't get done and the engineering required to diagnose why. On anything time-sensitive, that compounds fast.

vidarh · 2026-03-10T14:40:38 1773153638

Exactly. At the moment it's close enough to be a wash for some cases, or tilts seriously one direction or other for others. I expect improved harnesses means more and more we'll just be able to re-run a couple of times, and fall back to "escalating" to Sonnet or even Opus, but whenever it involves egineering time, that's a big deal.

Bombthecat · 2026-03-10T13:08:36 1773148116

In 12 months, opus will be better than now and you still won't use it lol

vidarh · 2026-03-10T14:38:18 1773153498

I still won't use what? I use Opus now, and I will use Opus then too, but as I clearly stated:

My default model has now dropped to Sonnet, because Sonnet can now do most of my tasks, and we already use Kimi, Deepseek, and Qwen.

They're just not cost-effective enough to be my main driver yet. They are however cheap enough that for things where the Claude TOS does not let me use my subscription, they still add substantial value. Just not nearly as much as I'd like.

The bulk of my tasks won't get harder as time passes, and so will move down the value chain as the cheaper models get better.

For the small proportion of my tasks that benefits from a smarter model, I will use the smartest model I can afford.

2026-03-10T16:19:28 1773159568

[dead]

vidarh · 2026-03-10T16:54:27 1773161667

Thankfully it's not as bad as that. The 50% that goes red means we re-execute those steps, potentially several times, to see if they succeed, before we even bother manually looking at it. But the overall principle holds: First yoou multiply the cost by re-running, then eventually you either need to kick it up to a more expensive model and/or a human.

But of course this is also only viable for non-latency sensitive work, for starters.

cesarvarela · 2026-03-10T13:48:30 1773150510

The harness makes a difference too.

yorwba · 2026-03-10T12:01:48 1773144108

Where are you getting those benchmark figures from? Math-500 should be closer to 98% for both models: https://artificialanalysis.ai/evaluations/math-500?models=de...

coldtea · 2026-03-09T22:44:33 1773096273

The parts that are not hearsay from anonymous sources, which basically means any paranoid story that the FBI still has to document, or blackmailers and grifters with plenty of holes and inconsistencies to their stories, are about about elites partying with 17 and above year olds who where otherwise active already in related "work". Still shady, but hardly what's being reported in the sensationalist coverage, which ranges from abductions and rings to acid baths for murdered victims.

https://www.mtracey.net/p/we-need-to-talk-about-virginia

https://www.mtracey.net/p/epstein-survivors-refusing-questio...

https://www.mtracey.net/p/i-found-the-real-epstein-coverup

sickofparadox · 2026-03-09T22:46:50 1773096410

Don't forget the people claiming that Epstein ate people because of his weird emails about beef jerky.

coldtea · 2026-03-09T22:50:07 1773096607

Yeah or stupid jokes like this taken to mean actual torture:

https://www.reddit.com/r/Epstein/comments/1r1nuv7/ice_cream_...