On TheTom’s llama-cpp fork, TurboQuant makes inference about five to ten times slower than vanilla (M1 Max, qwen3.6-35b-a3b). Seems like the productionization is still a ways away. Excited to see what we can get it down to though.
On TheTom’s llama-cpp fork, TurboQuant makes inference about five to ten times slower than vanilla (M1 Max, qwen3.6-35b-a3b). Seems like the productionization is still a ways away. Excited to see what we can get it down to though.