One other thing you might want to check out for running locally. (I have not independently verified yet, it's on the TODO list though)
https://docs.vllm.ai/en/latest/api/vllm/model_executor/layer...
vLLM apparently already has an implementation of turboquant available - which is said to losslessly reduce the memory footprint required by 6x and improve inference speed by 8x.
From what I understand, the steps are:
1. launch vLLM 2. execute a vLLM configure command like "use kv-turboquant for model xyz" 3. that's it
I've got two kids under 8 years old, a full time job, and a developer-tools project that takes like 105% of my mental interests... so there's been a bit of a challenge finding the time to swap from ollama to vLLM in order to find out if that is true.
SO buyer beware :D - and also - if anyone tries it, please let me know if it is worth the time to try it!