Have you tried llama.cpp with unsloth and models suited to it? GLM flash? It seemed to allow more models to be tried soon after they are released. Haven’t tried for long term deployment though, that’s the next step.
Have you tried llama.cpp with unsloth and models suited to it? GLM flash? It seemed to allow more models to be tried soon after they are released. Haven’t tried for long term deployment though, that’s the next step.
Highy anecdotal: I have tried various self-hosted models using both vllm and llama.cpp. I am in a situation where I have access to large amount of memory (~320 GB).
While experimenting with quantization I found that there is a non-trivial tradeoff between quality and memory footprint. Overall my experience follows the reported pattern of "2-bit is mwah, 4-bit half decent and 6-bit required for programming. Still, although MiniMax-m2.7 is useable with the 6-bit quantizations that unsloth provides, it felt like such a breath of fresh air when I used the reference full-size model.
I find it difficult to say why. I had mostly the same setup as before (parsing had to be slightly adjusted in Zed). Aside from not experiencing the thinking loops (where minimax would get stuck generating the same sentences over and over) there is little evidence of any real improvement (although the average thinking time felt shorter).
I would recommend against very low quantizations of GLM 5.0/5.1/5.2 or Kimi 2.5/2.6. Smaller models were more reliable, and therefore more useful.
I only have access to 96GB VRAM locally, but I'd agree with the general approach of avoiding lower quantizations, often anything below Q8 seems to suffer greatly on quality and seemingly never worth going below it, better to go for smaller model in that case.
With the exception of DwarfStar + DS4-Flash with IQ2_XXS quantization, which somehow seems to not suffer as much as I'd thought. I'd still opt for a smaller model + at least Q8.
I have tried llama-cpp, vllm is nicer (ray, handles queueing, doesn't have the cache invalidation bug for qwen/gemma models) and unsloth has toxic employees in their discord.
I've run 2 qwen/gemma @8bit with full context window side-by-side. Right now I have 4 models on my spark (qwen36moe, embedding, reranker, qwen3-1.7B) to support my markdown kb tool.
The setup is not as capable, but still good and gets better with models/algos. To me, it's more about the freedom to tinker, freedom from token bill anxiety, and potential right to compute should the government/oligarchy decides it gets to decide who can access which models.
> unsloth has toxic employees in their discord
Would you mind elaborating on this?
Sure,
I shared a project in their #research channel where I used their qwen36moe quant to refresh my PhD research. The channel had a topic that ended with something like "and all things research..."
One of their people accused me of self-promotion, and I reiterated that I shared it in that channel because it was their quant doing something (I thought) interesting as a research model. The number of people interested in the topic can be counted on your hands (in binary).
They remained accusatory, made it personal, and then started deleting messages. I suppose I escalated a bit (from their perspective), saying how this was not a good first encounter, they could have asked me to move it instead of just deleting it. Then they deleted every message, including all of their own, and put me in timeout. Erased from history, unable to participate, and so I left.
A coworker of mine (ML guy) is also sus about their quants, not nefarious, more that their benchmark results do not mean they are better, possibly skewed / benchmaxxed.