Hacker News

I have a old supermicro X10DRU-i server with two Tesla V100's (48 GB VRAM) and 128GB RAM and have been running qwen3.6-27B with a lot of success. I would say it's performance on my use case (modifying and extending a ~70kloc C++ code base) has been excellent. I have no benchmarks, but it seems comparable to claude sonnet 4.6 in capabilities. I run it with llama.cpp:

llama-server -m Qwen3.6-27B-Q8_0.gguf -c 131072 --tensor-split 0.4,0.6 --batch-size 256 --cont-batching --flash-attn on -ngl 999 --threads 16 --jinja

I regularly get ~22tok/s when context utilization is below <65k, but it does slow done to ~13tok/s when the context is nearly full (lots of swapping to RAM). I have been using the qwen-code harness though, since it is far more token efficient than claude-code which injects massive prompts that chew up the context window. I plan on trying it with pi next.

I'm keeping my ~$20/mo claude subscripts for the planning prompts, and then hand it off to qwen for implementation. It's been working well so far.