You can also run Qwen 3.6 27B dense model on DGX Spark with comparable performance [1][2] for about $4000 (Asus Ascent GX10 is $3999 at various retailers).
In theory you can also get 48GB of VRAM with, say, two 3090s, but it will take up a lot of space and generate a lot of heat compared to the Macbook Pro and GB10.
Alternatively you could run it on Strix Halo for $1,000 less, and while it may be slightly slower you won't have to deal with NVIDIA's shit on Linux and worrying about having to use their custom kernels or Ubuntu.
> 48GB of VRAM with, say, two 3090s
So like... $2000+ just for the used GPUs? Plus I assume it's considerably more effort to get it working.
>Plus I assume it's considerably more effort to get it working.
Nah, not really. It is a little annoying in terms of space and power, though. Not every case and motherboard can support cards that big.
The tweet you link shows "Qwen 3.6 35b NVFP4 - 256k ctx, 110 tok/s", but I'm getting only half that, around 50 tok/sec, on a DGX Spark with Qwen3.6-35B-A3B-NVFP4 (via vLLM) plus speculative decode w/EAGLE3. I'd be ecstatic to see 110 tok/sec and I wish they had some more sourcing for the exact config, because it's double what I'm getting.
edit - after actually reading the tweets (had to use xcancel) and visiting the source git repo, switching to MTP for speculative decode makes things a hell of a lot faster, and the abliterated model plus dflash makes it even faster! I'm now seeing 70-90 tok/sec for most stuff. I like!
I think Atlas might also be slightly faster than vLLM:
https://flowtivity.ai/blog/120-tok-s-1m-context-private-ai-d...