Hacker News

Would you use a different quant with a 128 GB machine? Could you link the specific download you used on huggingface? I find a lot of the options there to be confusing.

dust42 2 days ago [ - ]

I usually use unsloth quants, in this case https://huggingface.co/unsloth/Qwen3-Coder-30B-A3B-Instruct-... - the Q4_K_M variant.

On 128GB I would definitely run a larger model, probably with ~10B active parameters. All depends how many tokens per second is comfortable for you.

To get an idea of the speed difference, there is a benchmark page for llama.cpp on Apple silicon here: https://github.com/ggml-org/llama.cpp/discussions/4167

About quant selection: https://gist.github.com/Artefact2/b5f810600771265fc1e3944228...

And my workaround for 'shortening' prompt processing time: I load the files I want to work on (usually 1-3) into context with the instruction: read the code and wait. And while the LLM is doing the prompt processing I write my instructions of what I want to have done. Usually the LLM is long finished with PP before I am finished with writing instructions. Due to KV caching the LLM then gives almost instantly the answer.

tommy_axle 2 days ago [ - ]

Not the OP but yes you can definitely get a bigger quant like Q6 if it makes a difference but you also can go with a bigger param model like gpt oss 120B. A 70B would probably be great for a 128GB machine, which I don't think qwen has. You can search for the model you're interested in on hugging face often with "gguf" to get it ready to go (e.g. https://huggingface.co/ggml-org/gpt-oss-120b-GGUF/tree/main). Otherwise it's not a big deal to quantize yourself using llama.cpp.