Ok so my understanding: you can have the network generate a token that can be used as input to future token generation along with each output token it generates
These are called reasoning tokens
Initial results with gpt2 are promising
You can generalize this to let the network decide when to generate reasoning tokens (I'm unclear on how). There were also multiple lines in the loss graph with reasoning tokens that I don't quite understand (what's reasoning 1 vs 3? Is it the ratio of reasoning tokens? Something else?)
Reasoning 1 vs. 3 is the number of reasoning tokens between each "text" token. The 1 reasoning token is exactly what you see in the picture explanation in the article.
The generalization comes from making the network predict a <"start reasoning token"> and end the sequence only when it predicts a <"end reasoning token">. The training dataset for the upcoming experiment contains examples like: """ Q: What is 3+2? A: 3 + 2 is equal to <start reasoning> <reasoning> ... <reasoning> <end reasoning> 5 """
Wasting two tokens on start/end reasoning seems expensive to me (a priori)
I am curious what that would yield though - in some ways that would be the most fun to analyze (when does it think a lot??)
I would also be curious to see at what point you see diminishing returns from reasoning tokens (eg a 1:10 ratio? More?)
I'm just speculating here since I don't know what or where the code is but since inference is still autoregressive;
given [a b c] sample [d]
distribution of [d] could be over [reasoning token] | [vocab token]
then at next step you have
[a b c d] and each has an embedding vector associated
so when you go to sample [e] it's a function of [a b c d]