More

WithinReason · 2026-03-13T14:55:57 1773413757

My guess would be in the ballpark of about 10000 times less efficient

WithinReason · 2026-03-12T12:48:26 1773319706

If you look at a separate trend for the smaller Sonnet models, you can see a rapid trend

suddenlybananas · 2026-03-12T12:57:40 1773320260

3.7 to 4.5 looks pretty flat here.

WithinReason · 2026-03-12T12:33:37 1773318817

> If you know anything about NNs and about average code quality, that LLMs never will be able to generate high quality code.

I recommend looking into a subject called "reinforcement learning", the way AI acquired superhuman skills in chess, go, etc.

ivanvoid · 2026-03-13T00:49:12 1773362952

Obviously I familiar with RL, written multiple training pipelines in my day. and in order to gain that “super human skill” using RL you need to define fit functions and provide environments that will provide you with feedback that used for training. Go and chess are have clear rules and environment that provide you with a signal of success, I waiting to see this for coding, I don’t say it’s impossible just orders of magnitude harder

WithinReason · 2026-03-12T09:04:22 1773306262

> In any graphics application trigonometric functions are frequently used.

Counterpoint from the man himself, "avoiding trigonometry":

https://iquilezles.org/articles/noacos/

djmips · 2026-03-12T11:31:28 1773315088

And further to that. https://fgiesen.wordpress.com/2010/10/21/finish-your-derivat...

WithinReason · 2026-03-11T13:30:19 1773235819

> a fundamentally different compute profile on commodity CPU

In what way? On modern processors, a Fused Multiply-Add (FMA) instruction generally has the exact same execution throughput as a basic addition instruction

ismailmaj · 2026-03-11T14:37:38 1773239858

You drop the memory throughput requirements because of the packed representation of bits so an FMA can become the bottleneck, and you bypass the problem of needing to upscale the bits to whatever FP the FMA instruction needs.

typically for 1-bit matmul, you can get away with xors and pop_counts which should have a better throughput profile than FMA when taking into account the SIMD nature of the inputs/outputs.

WithinReason · 2026-03-11T16:35:16 1773246916

yes but this is not 1 bit matmul, it's 1.58 bits with expensive unpacking

ismailmaj · 2026-03-11T17:02:01 1773248521

The title and the repo uses 1-bit when it means 1.58 bits tertiary values, it doesn't change any of my arguments (still xors and pop_counts).

WithinReason · 2026-03-11T17:56:56 1773251816

How do you do ternary matmul with popcnt on 1.58 bit packed data?

ismailmaj · 2026-03-11T18:09:15 1773252555

Assuming 2 bit per values (first bit is sign and second bit is value).

actv = A[_:1] & B[_:1]

sign = A[_:0] ^ B[_:0]

dot = pop_count(actv & !sign) - pop_count(actv & sign)

It can probably be made more efficient by taking a column-first format.

Since we are in CPU land, we mostly deal with dot products that match the cache size, I don't assume we have a tiled matmul instruction which is unlikely to support this weird 1-bit format.

anematode · 2026-03-11T20:24:43 1773260683

Haven't looked closely, but on modern x86 CPUs it might be possible to do much better with the gf2affineqb instructions, which let us do 8x8 bit matrix multiplications efficiently. Not sure how you'd handle the 2-bit part, of course.

WithinReason · 2026-03-12T07:48:58 1773301738

This is 11 bit ops and a subtract, which I assume is ~11 clocks, while you can just do:

l1 = dot(A[:11000000],B[:11000000]) l2 = dot(A[:00110000],B[:00110000]) l3 = dot(A[:00001100],B[:00001100]) l4 = dot(A[:00000011],B[:00000011])

result = l1 + l2 * 4 + l3 * 16 + l4 * 64

which is 8 bit ops and 4x8 bit dots, which is likely 8 clocks with less serial dependence

ActivePattern · 2026-03-11T15:30:29 1773243029

The win is in how many weights you process per instruction and how much data you load.

So it's not that individual ops are faster — it's that the packed representation lets each instruction do more useful work, and you're moving far less data from memory to do it.

actionfromafar · 2026-03-11T13:43:03 1773236583

Bitnet encoding more information dense per byte perhaps? CPUs have slow buses so would eke out more use of bandwidth?

WithinReason · 2026-03-10T15:16:31 1773155791

Here is a paper that made a similar observation recently:

https://www.alphaxiv.org/abs/2512.19941

dnhkng · 2026-03-10T15:43:51 1773157431

Thanks for the link!

I think that these models have to learn to efficiently use their parameters, and the best way to do that is 'evolve' (yes, a bad word for it), structures over pretraining time. Unfortunately, they don't have a way to access these structures 'from the inside'. I hope this new approach lets up boost performance in s more experimentally rigorous way

WithinReason · 2026-03-10T15:47:36 1773157656

I think the recurrence is a consequence of using a residual connection, seems like that makes the representation stay consistent across layers

tgw43279w · 2026-03-10T15:24:33 1773156273

Very cool, thanks for sharing! Recovering 96% using just two blocks on IMN-1k, wow!

WithinReason · 2026-03-05T22:27:14 1772749634

Exactly, you can use bitcoin, even cash. You can even add credits with PayPal or a credit card, in which case Proton (I assume) won't remember your payment data. But if you attach credit card info permanently to your account then it can be retrieved.

WithinReason · 2026-03-05T15:32:27 1772724747

they might be referring to using a quantised version which gives them high performance and the accuracy drop is less important

WithinReason · 2026-03-05T07:58:48 1772697528

Competing with your own customers is not a good idea, especially before the bubble pops.

WithinReason · 2026-03-04T09:10:46 1772615446

Just buy a keyboard case for it, no need for permanent attachment. Or carry a tiny bluetooth keyboard in your pocket:

https://www.amazon.co.uk/dp/B0FWC8G2Q8/

bitwize · 2026-03-04T09:55:06 1772618106

Ah, Doohoeek, a time-honored, trusted brand.

hn_acc1 · 2026-03-04T19:23:31 1772652211

I'd rather buy from Doohickey.