Hacker News

wrsh07 a year ago [ - ]

Ok so my understanding: you can have the network generate a token that can be used as input to future token generation along with each output token it generates

These are called reasoning tokens

Initial results with gpt2 are promising

You can generalize this to let the network decide when to generate reasoning tokens (I'm unclear on how). There were also multiple lines in the loss graph with reasoning tokens that I don't quite understand (what's reasoning 1 vs 3? Is it the ratio of reasoning tokens? Something else?)

fesens a year ago [ - ]

Reasoning 1 vs. 3 is the number of reasoning tokens between each "text" token. The 1 reasoning token is exactly what you see in the picture explanation in the article.

The generalization comes from making the network predict a <"start reasoning token"> and end the sequence only when it predicts a <"end reasoning token">. The training dataset for the upcoming experiment contains examples like: """ Q: What is 3+2? A: 3 + 2 is equal to <start reasoning> <reasoning> ... <reasoning> <end reasoning> 5 """

wrsh07 a year ago [ - ]

Wasting two tokens on start/end reasoning seems expensive to me (a priori)

I am curious what that would yield though - in some ways that would be the most fun to analyze (when does it think a lot??)

I would also be curious to see at what point you see diminishing returns from reasoning tokens (eg a 1:10 ratio? More?)

pizza a year ago [ - ]

I'm just speculating here since I don't know what or where the code is but since inference is still autoregressive;

given [a b c] sample [d]

distribution of [d] could be over [reasoning token] | [vocab token]

then at next step you have

[a b c d] and each has an embedding vector associated

so when you go to sample [e] it's a function of [a b c d]