I've been running qwen3-5-9b-q4-k-m and qwen3-6-27b-q6-k simultaneously on an Intel Arc Pro B70 with a lot of success.

https://github.com/cptskippy/battlemage-llm-gateway

Opencode has been a huge productivity accelerator. I have two Hermes agents that I'm training to support my workflow with pretty good success. One is a personal assistant who manages my backlog and keeps me on task, follows up with me on items, and will put together research briefs. The other I use a general purpose coder and research and it's about 50:50 with the tasks I've given it. In fairness though, the task it failed at left me scratching my head to figure out as well.

Interesting setup, thx for sharing.

How many tokens/sec do you get with 27b? Are you using MTP?

Does Intel make decent GPUs now? I must be out of the loop...

They released a few good value GPUs for LLM inference about a year ago: more memory than AMD and NVIDIA consumer GPUs, not too expensive, but also not great tokens/watt.

I am not sure whether you can find those in stock anywhere.

I'm using an Intel Arc Pro B70 which has 32 GB of VRAM. It's estimated to get ~35-45 t/s at $21-27 $/t/s. An RTX 5090 is ~61 t/s at ~$33 $/t/s.

So in terms of raw power Nvidia is effortlessly still king, but in price-to-capacity Intel is best in class.

Intel's Battlemage GPUs also natively support SR-IOV and GPU partitioning which allows you to isolate workloads. This is useful in homelab environments if you have workloads that benefit from GPU acceleration. I was able to split the B70 into 4 virtual GPUs and hand them out to Frigate NVR, Plex, and other workloads.

What's the value running the smaller model too? Why not just the big model for everything? I note both are dense, as well.

Tokens per second. The difference between 8B and something like 16B is not as big as you might think in practical usage and 8B is a lot faster and interactive than 16B but there are certain things where it is useful to farm it out to the large model.

Agree. For local coding help, latency often matters more than raw benchmark quality. A slightly weaker model that answers immediately changes how often you reach for it.

Exactly this.

Creating conversation titles and parsing HTML/JSON don't benefit from 27B models.

The B70 can run both models comfortably side-by-side so it makes better use of time and resources.