Hacker News

Yes, but if the constraints only permit a single valid token anyway for some positions, you could skip the forward pass entirely for those positions and just return that token.

The other idea was a bit more theoretical: If you know only a handful tokens are valid, then calculating the logits of the other tokens in the forward pass is wasteful as they won't affect the sampling process. However, it's probably not worthe the cost to optimize that as it only affects the last layer and might be mostly amortized by SIMD parallel processing anyway.