Hacker News

Multi token prediction is the same thing as speculative decoding. This is mentioned in the Google pages describing their MTP implementation.

Google has now provided small models for each of the previous Gemma 4 models, e.g. "gemma-4-26B-A4B-it-assistant" for "gemma-4-26B-A4B-it".

The difference vs. Qwen is that here each small model is not some general-purpose smaller model, but a model that has been optimized specifically for this task, to predict the output of the bigger model with which it is paired.

This specialization and optimization of the Google "gemma-4-*-assistant" models ensures that they are much smaller and thus much faster than general-purpose small models.

fulafel 2 hours ago [ - ]

Multi-token prediction is a refined form of speculative decoding.

Researchers at Google came up with Speculative decoding in 2022: https://research.google/blog/looking-back-at-speculative-dec... (Fast Inference from Transformers via Speculative Decoding - Yaniv Leviathan, Matan Kalman, Yossi Matias)

Researchers at Meta came up with MTP, a smarter way of doing speculative decoding in 2024: https://arxiv.org/abs/2404.19737 (Better & Faster Large Language Models via Multi-token Prediction Fabian Gloeckle, Badr Youbi Idrissi, Baptiste Rozière, David Lopez-Paz, Gabriel Synnaeve)

DeepSeek V3 shipped MTP in a product first, in 2024: https://arxiv.org/abs/2412.19437 (DeepSeek-V3 Technical Report, 100+ authors)

julianlam 10 hours ago [ - ]

So then these models could be used by llama.cpp today with the -md switch?

Interesting, must try tomorrow.