Would you mind sharing what hardware/card(s) you're using? And is https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B... one of the ones you've tested?

Yes, I run it locally on 3 different AMD Strix Halo machines (Framework Desktop and 2 GMKTec machines, 128gb x 2, 96gb x 1) and a Mac Studio M2 Ultra 128gb of unified memory.

I’ve used several runtimes, including vLLM. Works great! Speedy. Best results with Ubuntu after trying a few different distributions and Vulkan and ROCm drivers.

Support for this landed in llama.cpp recently if anyone is interested in running it locally.