I'm just speculating here since I don't know what or where the code is but since inference is still autoregressive;
given [a b c] sample [d]
distribution of [d] could be over [reasoning token] | [vocab token]
then at next step you have
[a b c d] and each has an embedding vector associated
so when you go to sample [e] it's a function of [a b c d]