Hacker News

I'm just speculating here since I don't know what or where the code is but since inference is still autoregressive;

given [a b c] sample [d]

distribution of [d] could be over [reasoning token] | [vocab token]

then at next step you have

[a b c d] and each has an embedding vector associated

so when you go to sample [e] it's a function of [a b c d]