I'm still fairly new to local LLMs, spent some time setting up and testing a few Qwen3.6-35B-A3B models yesterday (mlx 4b and 8b, gguf Q4_K_M and Q4_K_XL I think).
Was impressed at how they ran on my 64G M4.
It looks like this new model is slightly "smarter" (based on the tables in TFA) but requires more VRAM. Is that it? The "dense" part being the big deal?
As 27B < 35B, should we expect some quantized models soon that will bring the VRAM requirement down?
that's not it. 35B-A3B is a "Mixture of Experts" model. Roughly, only ~3B parameters are active at a time. So, the actual computational requirements scale with this ~3B, rather than with the 35B (though you need high-bandwidth access to the full 35B layers though).
This model is a "dense" model. It will be much slower on macs. Concretely, on a M4 Pro, at Q6 gguf, it was ~9tok/s for me. 35-A3B (at Q4, with mlx, so not a fair comparison) was ~70 tok/s by comparison.
In general dedicated GPUs tend to do better with these kinds of "dense" models, though this becomes harder to judge when the GPU does not have enough VRAM to keep the model fully resident. For this model, I would expect if you have >=24GB VRAM you'd be fine, e.g. an NVIDIA {3,4,5}090-type thing.