so it's: output = layers(layers(layers(layers(input)))) instead of the classical...

oofbey · 2026-01-04T03:38:44 1767497924

Yeah if layers() is a shortcut for layer4(layer3(layer2(layer1(input)))). But sometimes it’s only

output = layers(input)

Or

output = layers(layers(input))

Depends on how difficult the token is.

remexre · 2026-01-04T16:53:25 1767545605

Or more like,

    x = tokenize(input)
    i = 0
    do {
      finish, x = layers(x)
    } while(!finish && i++ < t_max);
    output = lm_head(x)

oofbey · 2026-01-05T00:43:08 1767573788

That’s closer still. But even closer would be:

    x = tokenize(input)
    i = 0
    finish = 0
    do {
      p, x = layers(x)
      finish += p
    } while(finish < 0.95 && i++ < t_max);
    output = lm_head(x)

Except the accumulation of the stop probabilities isn’t linear like that - it’s more like a weighted coin model.