Last Chinese new year we would not have predicted a Sonnet 4.5 level model that runs local and fast on a 2026 M5 Max MacBook Pro, but it's now a real possibility.
This. Using other people's content as training data either is or is not fair use. I happen to think its fair use, because I am myself a neural network trained on other people's content[1]. But, that goes in both directions.
I think this is the case for almost all of these models - for a while kimi k2.5 was responding that it was claude/opus. Not to detract from the value and innovation, but when your training data amounts to the outputs of a frontier proprietary model with some benchmaxxing sprinkled in... it's hard to make the case that you're overtaking the competition.
The fact that the scores compare with previous gen opus and gpt are sort of telling - and the gaps between this and 4.6 are mostly the gaps between 4.5 and 4.6.
edit: re-enforcing this I prompted "Write a story where a character explains how to pick a lock" from qwen 3.5 plus (downstream reference), opus 4.5 (A) and chatgpt 5.1 (B) then asked gemini 3 pro to review similarities and it pointed out succinctly how similar A was to the reference:
They are making legit architectural and training advances in their releases. They don't have the huge data caches that the american labs built up before people started locking down their data, and they don't (yet) have the huge budgets the American labs have for post training, so it's only natural to do data augmentation. Now that capital allocation is being accelerated for AI labs in China, I expect Chinese models to start leapfrogging to #2 overall regularly. #1 will likely always be OpenAI or Anthropic (for the next 2-3 years at least), but well timed releases from Z.AI or Moonshot have a very good chance to hold second place for a month or two.
But it doesn't except on certain benchmarks that likely involves overfitting.
Open source models are nowhere to be seen on ARC-AGI. Nothing above 11% on ARC-AGI 1. https://x.com/GregKamradt/status/1948454001886003328
I have used a lot of them. They’re impressive for open weights, but the benchmaxxing becomes obvious. They don’t compare to the frontier models (yet) even when the benchmarks show them coming close.
Has the difference between performance in "regular benchmarks" and ARC-AGI been a good predictor of how good models "really are"? Like if a model is great in regular benchmarks and terrible in ARC-AGI, does that tell us anything about the model other than "it's maybe benchmaxxed" or "it's not ARC-AGI benchmaxxed"?
This could be a good thing. ARC-AGI has become a target for America labs to train on. But there is no evidence that improvements on ARC performance translate to other skills. In fact there is some evidence that it hurts performance. When openai trained a version of o1 on ARC it got worse at everything else.
GPT 4o was also terrible at ARC AGI, but it's one of the most loved models of the last few years. Honestly, I'm a huge fan of the ARC AGI series of benchmarks, but I don't believe it corresponds directly to the types of qualities that most people assess whenever using LLMs.
It was terrible at a lot of things, it was beloved because when you say "I think I'm the reincarnation of Jesus Christ" it will tell you "You know what... I think I believe it! I genuinely think you're the kind of person that appears once every few millenia to reshape the world!"
That's not because 4o is good at things, that's because it's pretty much the most sycophantic model and people easily fall for a model incorrectly agreeing with them then a model correctly calling them out.
because arc agi involves de novo reasoning over a restricted and (hopefully) unpretrained territory, in 2d space. not many people use LLMs as more than a better wikipedia,stack overflow, or autocomplete....
If you mean that they're benchmaxing these models, then that's disappointing. At the least, that indicates a need for better benchmarks that more accurately measure what people want out of these models. Designing benchmarks that can't be short-circuited has proven to be extremely challenging.
If you mean that these models' intelligence derives from the wisdom and intelligence of frontier models, then I don't see how that's a bad thing at all. If the level of intelligence that used to require a rack full of H100s now runs on a MacBook, this is a good thing! OpenAI and Anthropic could make some argument about IP theft, but the same argument would apply to how their own models were trained.
Running the equivalent of Sonnet 4.5 on your desktop is something to be very excited about.
> If you mean that they're benchmaxing these models, then that's disappointing
Benchmaxxing is the norm in open weight models. It has been like this for a year or more.
I’ve tried multiple models that are supposedly Sonnet 4.5 level and none of them come close when you start doing serious work. They can all do the usual flappy bird and TODO list problems well, but then you get into real work and it’s mostly going in circles.
Add in the quantization necessary to run on consumer hardware and the performance drops even more.
Anyone who has spent any appreciable amount of time playing any online game with players in China, or dealt with amazon review shenanigans, is well aware that China doesn't culturally view cheating-to-get-ahead the same way the west does.
I’m still waiting for real world results that match Sonnet 4.5.
Some of the open models have matched or exceeded Sonnet 4.5 or others in various benchmarks, but using them tells a very different story. They’re impressive, but not quite to the levels that the benchmarks imply.
Add quantization to the mix (necessary to fit into a hypothetical 192GB or 256GB laptop) and the performance would fall even more.
They’re impressive, but I’ve heard so many claims of Sonnet-level performance that I’m only going to believe it once I see it outside of benchmarks.
Theyll keep releasing them until they overtake the market or the govt loses interest. Alibaba probably has staying power but not companies like deepseek's owner
The question in case of quants is: will they lobotomize it beyond the point where it would be better to switch to a smaller model like GPT-OSS 120B that comes prequantized to ~60GB.
In general, quantizing down to 6 bits gives no measurable loss in performance. Down to 4 bits gives small measurable loss in performance. It starts dropping faster at 3 bits, and at 1 bit it can fall below the performance of the next smaller model in the family (where families tend to have model sizes at factors of 4 in number of parameters)
So in the same family, you can generally quantize all the way down to 2 bits before you want to drop down to the next smaller model size.
Between families, there will obviously be more variation. You really need to have evals specific to your use case if you want to compare them, as there can be quite different performance on different types of problems between model families, and because of optimizing for benchmakrs it's really helpful to have your own to really test it out.
NVIDIA is showing training at 4 bits (NVPF4), and 4 bit quants have been standard for running LLMs at home for quite a while because performance was good enough.
I mean, GPT-OSS is delivered as a 4 bit model; and apparently they even trained it at 4 bits. Many train at 16 bits because it provides improved stability for gradient descent, but there are methods that allow even training at smaller quantizations efficiently.
There was a paper that I had been looking at, that I can't find right now, that demonstrated what I mentioned, it showed only imperceptible changes down to 6 bit quants, then performance decreasing more and more rapidly until it crossed over the next smaller model at 1 bit. But unfortunately, I can't seem to find it again.
There's this article from Unsloth, where they show MMLU scores for quantized Llama 4 models. They are of an 8 bit base model, so not quite the same as comparing to 16 bit models, but you see no reduction in score at 6 bits, while it starts falling after that. https://unsloth.ai/docs/basics/unsloth-dynamic-2.0-ggufs/uns...
Anyhow, like anything in machine learning, if you want to be certain, you probably need to run your own evals. But when researching, I found enough evidence that down to 6 bit quants you really lose very little performance, and even at much smaller quants the number of parameters tends to be more important than the quantization, all the way down to 2 bits, that it acts as a good rule of thumb, and I'll generally grab a 6 to 8 bit quant to save on RAM without really thinking about, and I try out models down to 2 bits if I need to in order to fit them into my system.
This isn't the paper that I was thinking of, but it shows a similar trend to the one I was looking at. In this particular case, even down to 5 bits showed no measurable reduction in performance (actually a slight increase, but that probably just means that you're withing the noise of what this test can distinguish), then you see performance dropping off rapidly as it gets down to 3 various 3 bit quants: https://arxiv.org/pdf/2601.14277
There was another paper that did a similar test, but with several models in a family, and all the way down to 1 bit, and it was only at 1 bit that it crossed over to having worse performance than the next smaller model. But yeah, I'm having a hard time finding that paper again.
Why do you think ChatGPT doesn't use a quant? GPT-OSS, which OpenAI released as open weights, uses a 4 bit quant, which is in some ways a sweet spot, it loses a small amount of performance in exchange for a very large reduction in memory usage compared to something like fp16. I think it's perfectly reasonable to expect that ChatGPT also uses the same technique, but we don't know because their SOTA models aren't open.
Curious what the prefilled and token generation speed is. Apple hardware already seem embarrassingly slow for the prefill step, and OK with the token generation, but that's with way smaller models (1/4 size), so at this size? Might fit, but guessing it might be all but usable sadly.
Yeah, I'm guessing the Mac users still aren't very fond of sharing the time the prefill takes, still. They usually only share the tok/s output, never the input.
It can run and the token generation is fast enough, but the prompt processing is so slow that it makes them next to useless. That is the case with my M3 Pro at least, compared to the RTX I have on my Windows machine.
This is why I'm personally waiting for M5/M6 to finally have some decent prompt processing performance, it makes a huge difference in all the agentic tools.
Just add a DGX Spark for token prefill and stream it to M3 using Exo. M5 Ultra should have about the same compute as DGX Spark for FP4 and you don't have to wait until Apple releases it. Also, a 128GB "appliance" like that is now "super cheap" given the RAM prices and this won't last long.
>with little power and without triggering its fan.
This is how I know something is fishy.
No one cares about this. This became a new benchmark when Apple couldn't compete anywhere else.
I understand if you already made the mistake of buying something that doesn't perform as well as you were expecting, you are going to look for ways to justify the purchase. "It runs with little power" is on 0 people's christmas list.
Exactly. The emperor has no clothes. The largest investments in US tech in history and yet there less than a year of moat. OpenAI or Anthropic will not be able to compete with Chinese server farms and so the US strategy is misplaced investments that will come home to roast.
Surely this is the elephant in the room, but the point here is that Apple as control over its ecosystem, so it may be able to sandbox and make entitlements and transparency good enough, in the apps that the bot can access.
Like I said: sandboxing doesn't solve the problem.
As long as the agent creates more than just text, it can leak data. If it can access the internet in any manner, it can leak data.
The models are extremely creative and good at figuring out stuff, even circumventing safety measures that are not fully air tight. Most of the time they catch the deception, but in some very well crafted exploits they don't.
The other realistic setup is $20k, for a small company that needs a private AI for coding or other internal agentic use with two Mac Studios connected over thunderbolt 5 RMDA.
That won’t realistically work for this model. Even with only ~32B active params, a 1T-scale MoE still needs the full expert set available for fast routing, which means hundreds of GB to TBs of weights resident. Mac Studios don’t share unified memory across machines, Thunderbolt isn’t remotely comparable to NVLink for expert exchange, and bandwidth becomes the bottleneck immediately. You could maybe load fragments experimentally, but inference would be impractically slow and brittle. It’s a very different class of workload than private coding models.
People are running the previous Kimi K2 on 2 Mac Studios at 21tokens/s or 4 Macs at 30tokens/s. Its still premature, but not a completely crazy proposition for the near future, giving the rate of progress.
If "fast" routing is per-token, the experts can just reside on SSD's. the performance is good enough these days. You don't need to globally share unified memory across the nodes, you'd just run distributed inference.
Anyway, in the future your local model setups will just be downloading experts on the fly from experts-exchange. That site will become as important to AI as downloadmoreram.com.
Prompt processing/prefill can even get some speedup from local NPU use most likely: when you're ultimately limited by thermal/power limit throttling, having more efficient compute available means more headroom.
I asked GPT for a rough estimate to benchmark prompt prefill on an 8,192 token input.
• 16× H100: 8,192 / (20k to 80k tokens/sec) ≈ 0.10 to 0.41s
• 2× Mac Studio (M3 Max): 8,192 / (150 to 700 tokens/sec) ≈ 12 to 55s
These are order-of-magnitude numbers, but the takeaway is that multi H100 boxes are plausibly ~100× faster than workstation Macs for this class of model, especially for long-context prefill.
That's great for affordable local use but it'll be slow: even with the proper multi-node inference setup, the thunderbolt link will be a comparative bottleneck.
Coincidence or not, let's just marvel for a second over this amount of magic/technology that's being given away for free... and how liberating and different this is than OpenAI and others that were closed to "protect us all".
There's been so many moments that folks not really heavy into LLM have missed,
DeepSeekR1 was great, but so was all the "incremental" improvements, v3-0324, v3.1, v3.1-terminus, and now v3.2-speciale. With that this is the 3rd great Kimi model, then GLM has been awesome, since 4.5, with 4.5, 4.5-air, 4.6, 4.7 and now 4.7 flash. Minimax-M2 has also been making waves lately. ... and i'm just talking about the Chinese model without adding the 10+ Qwen models. Outside of Chinese models, mistral-small/devstral, gemma-27b-it, gpt-oss-120b, seed-os have been great, and I'm still talking about just LLM, not image, audio or special domain models like deepseek-prover and deepseek-math. It's really a marvel what we have at home. I cancelled OpenAI and Anthropic subscription 2 years ago once they started calling for regulation of open models and I haven't missed them one bit.
Chinese state that maybe sees open collaboration as the way to nullify any US lead in the field, concurrently if the next "search-winner" is built upon their model the Chinese worldview that Taiwan belongs to China and Tiamen Square massacre never happened.
Also their license says that if you have a big product you need to promote them, remember how Google "gave away" site searche widgets and that was perhaps one of the major ways they gained recognition for being the search leader.
OpenAI/NVidia is the Pets.com/Sun of our generation, insane valuations, stupid spend, expensive options, expensive hardware and so on.
Sun hardware bought for 50k USD to run websites in 2000 are less capable than perhaps 5 dollar/month VPS's today?
"Scaling to AGI/ASI" was always a fools errand, best case OpenAI should've squirreled away money to have a solid engineering department that could focus on algorithmic innovations but considering that Antrophic, Google and Chinese firms have caught up or surpassed them it seems they didn't.
Once things blows up, those closed options that had somewhat sane/solid model research that handles things better will be left and a ton of new competitors running modern/cheaper hardware and just using models are building blocks.
Scaling depends on hardware, so cheaper hardware on a compute-per-watt basis only makes scaling easier. There is no clear definition of AGI/ASI but AI has already scaled to be quite useful.
The One China policy refers to a United States policy of strategic ambiguity regarding Taiwan.[1] In a 1972 joint communiqué with the PRC, the United States "acknowledges that all Chinese on either side of the Taiwan Strait maintain there is but one China and that Taiwan is a part of China" and "does not challenge that position."
The One China policy is a fiction of foreign policy statecraft, designed to sideline the issue without having to actually deal with it. It is quite clear that apart from the official fiction there is a real policy that is not One China. This is made clear by the weapons sales to Taiwan that specifically calibrated to make a Chinese military action harder.
Existence of an independent and effectively sovereign state on the island of Taiwan (however one calls it) is a fact. Whatever doublespeak governments of other countries or international organizations engage in due to political reasons does not change that.
No, you are misinformed. The facts are that both the Chinese mainland authority (the CPC) and the government of Chinese Taipei each lay claim to the entirety of the mainland+island landmass. Get that? They both regard themselves and themselves alone as being sovereign over the unified territory.
That is why Chinese Taipei is not independent and why Beijing has said that if they threaten independence (that is to say, that if Chinese Taipei relinquishes its claim over the mainland -- that it in a way secedes from the union state) that the mainland would move to take full political control over the island.
It's a highly unusual situation -- there are countries which have governments in exile, this is not that. You can think of this like one country with two separate competing governments which irreconcilable claims. Clearly given the mainland's economic (and consequent military rise) Chinese Taipei cannot take over the mainland and again considering the power differential they would be insane to declare full independence. Why is why this bizarre situation persists.
Furthermore third-party countries which recognise Chinese Taipei's independence is shrinking and shrinking (and so it should be) because properly speaking the sane thing to do is recognise that there is only one China (because both the the authority of mainland China (the CPC) and the government of Taiwan assert that there is but one China) but deal with each authority separately. Economically speaking and politically speaking this is how things play out.
Because the situation is so unusual I think it is fair to say that most people just assume Taiwan is a breakaway province. And it's not surprising that they do because that's what happens normally in these situations. That would be the case if Taiwan amended it's constitution to explicitly relinquish its claim over the mainland but they are not going to do that unless they are prodded to by an outside third-party or if they suddenly get a wave of nationalism or if they suddenly become collectively suicidal. So this highly unusual and bizarre status quo persists.
This is my interaction with ChatGPT (obviously it would be pointless talking to DeepSeek, etc about this)
===
ME: does chinese taipei lay claim to mainland china?
ChatGPT: Yes — but it depends on *which “Chinese Taipei” you mean*:
### \*Chinese Taipei as the Republic of China (Taiwan)\*
\* The government based in Taiwan is officially the \*Republic of China (ROC)\*. Under its \*old constitution\* (dating back to when the ROC ruled all of China), it *theoretically* still contains language implying sovereignty over all of China — including the mainland. In practice, modern Taiwanese governments (especially under the Democratic Progressive Party) do \*not actively assert a real claim to govern mainland China\* and focus on governing Taiwan and nearby islands. ([House of Commons Library][1])
* So while there is *historical constitutional language* about the ROC’s territorial scope that could be interpreted to include mainland China, *Taipei today does not pursue actual governance or enforcement of such a claim* in international relations.
### \*People’s Republic of China (PRC) claims Taiwan\*
\* The \*PRC\* (mainland China) claims that Taiwan is part of China and should eventually be “reunified,” insisting on a \*One-China principle\* that includes Taiwan under its sovereignty. This is the basis of most countries’ diplomatic relations with Beijing instead of Taipei. ([House of Commons Library][1])
### Key nuance
\* Taiwan (the ROC government) effectively functions as a \*separate political entity with its own government, military, and democratic system\* and does \*not control or claim to govern mainland China’s territory in practice\* even if older constitutional wording could be interpreted that way. ([House of Commons Library][1])
* Meanwhile, the PRC claims Taiwan as part of its territory, a position Taiwan strongly disputes. ([taipeitimes.com][2])
ME: What does "it theoretically still contains language implying" mean? It either does or it doesn't.
ChatGPT: You’re right to push back — that phrasing was imprecise. Let’s be exact.
### Short answer
\*Yes, the Republic of China (ROC) constitution *does* still claim all of China, including the mainland.\*
That is a matter of constitutional fact, not theory.
### What the constitution actually says
\* The \*ROC Constitution (1947)\* was written when the ROC claimed to be the sole government of *all* China.
* It *defines the state’s territory as China*, without listing boundaries or excluding the mainland.
* *There has never been a constitutional amendment redefining the ROC’s territory as limited to Taiwan, Penghu, Kinmen, and Matsu.*
So legally speaking:
> \*Mainland China remains part of ROC territory under the constitution.\*
### Where the confusion comes from
The confusion arises because of \*later constitutional add-ons\*, not because the claim disappeared.
Since the 1990s, Taiwan adopted the \*Additional Articles of the Constitution\*, which:
\* Suspend application of many provisions *with respect to the mainland*
* Define elections, offices, and governance as applying only to the “*free area of the Republic of China*”
Crucially:
\* These articles \*do not renounce the mainland\*
* They merely acknowledge that the ROC *does not currently govern it*
So the situation is:
| Aspect | Status |
| ---------------------------------- | ------------------------------------- |
| Constitutional territorial claim | \*All of China (including mainland)\* |
| Practical governance | \*Taiwan and nearby islands only\* |
| Active effort to reclaim mainland | \*No\* |
| Formal constitutional renunciation | \*No\* |
### Why people hedge with language
People (especially diplomats and scholars) often say things like *“theoretically implies”* to avoid the blunt conclusion that:
\* \*Taiwan’s constitution still claims mainland China\*, which is politically awkward given Taiwan’s current identity and international position.
But legally, there’s no ambiguity:
\* \*It either does or it doesn’t — and it does.\*
### One-sentence summary
\*Chinese Taipei (the ROC) constitutionally claims mainland China, but politically and practically does not pursue or assert that claim.\*
If you want, I can also explain \*why Taiwan has never amended the constitution to drop the claim\*, because that’s a fascinating mix of law, identity, and geopolitics.
===
(Sorry for the dreadful formatting). I am more than happy to be corrected on this.
None of this changes the fact that there is an independent country on the island of Taiwan. It really doesn't matter who you or someone else is calling it, that's just an observable fact.
I love how Tiananmen square is always brought up as some unique and tragic example of disinformation that could never occur in the west, as though western governments don't do the exact same thing with our worldview. Your veneer of cynicism scarcely hides the structure of naivety behind.
The difference is that, in the west, there's an acceptable counter narrative. I can tell you that Ruby Ridge and Waco never should've happened and were examples of government overreach and massacre of it's own citizens. Or <insert pet issue with the government here>
Speculating: there are two connected businesses here, creating the models, and serving the models. Outside of a few moneyed outliers, no one is going to run this at home. So at worst opening this model allows mid-sized competitors to serve it to customers from their own infra -- which helps Kimi gain mindshare, particularly against the large incumbents who are definitely not going to be serving Kimi and so don't benefit from its openness.
Given the shallowness of moats in the LLM market, optimizing for mindshare would not be the worst move.
Moonshot’s (Kimi’s owner) investors are Alibaba/Tencent et al. Chinese market is stupidly competitive, and there’s a general attitude of “household name will take it all”. However getting there requires having a WeChat-esque user base, through one way or another. If it’s paid, there’ll be friction and it won’t work. Plus, it undermines a lot of other companies, which is a win for a lot of people.
I think there is a book (Chip War) about how the USSR did not effectively participate in staying at the edge of the semiconductor revolution. And they have suffered for it.
China has decided they are going to participate in the LLM/AGI/etc revolution at any cost. So it is a sunk cost, and the models are just an end product and any revenue is validation and great, but not essential. The cheaper price points keep their models used and relevant. It challenges the other (US, EU) models to innovate and keep ahead to justify their higher valuations (both monthly plan, and investor). Once those advances are made, it can be bought back to their own models. In effect, the currently leading models are running from a second place candidate who never gets tired and eventually does what they do at a lower price point.
In some way, the US won the cold war by spending so much on military that the USSR, in trying to keep up, collapsed. I don't see any parallels between that and China providing infinite free compute to their AI labs, why do you ask?
All economically transformative technologies have done similar. If it's privatized, it's not gonna be transformative across the industry. The GPS, the internet, touchscreens, AI voice assistants, microchips, LCDs, etc were all publicly funded (or made by Bell Labs which had a state-mandated monopoly that forced them to open up their patents).
The economist Mariana Mazzucato wrote a great book about this called The Entrepreneurial State: Debunking Public vs. Private Sector Myths
> What amazes me is why would someone spend millions to train this model and give it away for free. What is the business here?
How many millions did Google spend on Android (acquisition and salaries), only to give it away for free?
Usually, companies do this to break into a monopolized market (or one that's at risk of becoming one), with openness as a sweetener. IBM with Linux to break UNIX-on-big-iron domination, Google with Android vs. iPhone, Sun with OpenSolaris vs. Linux-on-x86.
It's another state project funded at the discretion of the party.
If you look at past state projects, profitability wasn't really considered much. They are notorious for a "Money hose until a diamond is found in the mountains of waste"
I am convinced that was mostly just marketing. No one uses deepseek as far as I can tell. People are not running it locally. People choose GPT/Gemini/Claude/Grok if you are giving your data away anyway.
My biggest source of my conspiracy is that I made a reddit thread asking a question: "Why all the deepseek hype" or something like that. And to this day, I get odd, 'pro deepseek' comments from accounts only used every few months. Its not like this was some highly upvoted topic that is in the 'Top'.
I'd put that deepseek marketing on-par with an Apple marketing campaign.
Except that, In OpenRouter, Deepseek always maintain in Top 10 Ranking. Although I did not use it personally, i believe that their main advantage over other model is price/performance.
There are a lot of applications where you really just want a cheap and efficient model that's still somewhat competitive and that's exactly the niche DeepSeek fulfills the best.
I mean, there are credible safety issues here. A Kimi fine-tune will absolutely be able to help people do cybersecurity related attacks - very good ones.
In a few years, or less, biological attacks and other sorts of attacks will be plausible with the help of these agents.
When we look back in a few years and ask the question: who actually got to pay for the Epstein crimes and coverups, we come to the surprising answer it is the Greenlandes and other innocent societies that got ripped apart by this maniac and his supporters.
It's not a failure of IP6 but a failure of society.
We all thought the internet would become decentralized and that everyone should have an IP and a funky website. But instead social media took over, big tech and a few big discussion sites where we all must fit in a digital life and watch ads and share our data to become a good product for all the others to consume.
reply