I don’t do ‘evals’, but I do process billions of tokens every month, and I’ve found these small Nvidia models to be the best by far for their size currently.
As someone else mentioned, the GPT-OSS models are also quite good (though I haven’t found how to make them great yet, though I think they might age well like the Llama 3 models did and get better with time!).
But for a defined task, I’ve found task compliance, understanding, and tool call success rates to be some of the highest on these Nvidia models.
For example, I have a continuous job that evaluates if the data for a startup company on aVenture.vc could have overlapping/conflated two similar but unrelated companies for news articles, research details, investment rounds, etc… which is a token hungry ETL task! And I recently retested this workflow on the top 15 or so models today with <125b parameters, and the Nvidia models were among the best performing for this type of work, particularly around non-hallucination if given adequate grounding.
Also, re: cost - I run local inference on several machines that run continuously, in addition to routing through OpenRouter and the frontier providers, and was pleasantly surprised to find that if I’m a paying customer of OpenRouter otherwise, the free variant there from Nvidia is quite generous for limits, too.
You may want to use the new "derestricted" variants of gpt-oss. While the ostensible goal of these variants is to de-censor them, it ends up removing the models' obsession with policy and wasting thinking tokens that could be used towards actually reasoning through a problem.
Great advice. Have you observed any other differences? I’ve been wondering if there are any specialized variants yet of GPT-OSS models yet that outperform on specific tasks (similar to the countless Llama 3 variants we’ve seen).
>the GPT-OSS models are also quite good
I recently pitted gpt-oss 120b against Qwen3-Next 80b on a lot of internal benchmarks (for production use), and for me, gpt-oss was slightly slower (vLLM, both fit in VRAM), much worse at multilingual tasks (33 languages evaluated), and had worse instruction following (e.g., Qwen3-Next was able to reuse the same prompts I used for Gemma3 perfectly, while gpt-oss struggled and RAG benchmarks suddenly went from 90% to 60% without additional prompt engineering).
And that's with Qwen3-Next being a random unofficial 4-bit quant (compared to gpt-oss having native support) + I had to disable multi-token prediction in Qwen3-Next because vLLM crashed with it.
Has someone here tried both gpt-oss 120b and Qwen3-Next 80b? Maybe I was doing something wrong because I've seen a lot of people praise gpt-oss.
gpt-oss is STEM-maxxed, so I imagine most of the praise comes from people using it for agentic coding.
> We trained the models on a mostly English, text-only dataset, with a focus on STEM, coding, and general knowledge.
https://openai.com/index/introducing-gpt-oss/
Completely agree. I was working on something with TensorRT LLM and threw Nemotron in there more on a whim. It completely mopped the floor with other models for my task (text style transfer), following joint moderation with another LLM & humans. Really impressed.
Would you mind sharing what hardware/card(s) you're using? And is https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B... one of the ones you've tested?
Yes, I run it locally on 3 different AMD Strix Halo machines (Framework Desktop and 2 GMKTec machines, 128gb x 2, 96gb x 1) and a Mac Studio M2 Ultra 128gb of unified memory.
I’ve used several runtimes, including vLLM. Works great! Speedy. Best results with Ubuntu after trying a few different distributions and Vulkan and ROCm drivers.
Support for this landed in llama.cpp recently if anyone is interested in running it locally.
What do you mean about not doing evals? Just literally that you don’t run any benchmarks or do you have something against them?
He's just saying anecdotally these models are good. A reasonable response might be "have you systematically evaluated them?". He has pre-answered - no.
Not OP, but perhaps they mean not putting too much faith in common benchmarks (thanks to benchmaxxing).
Yes to both comments. I said that to:
1. disclose my method was not quantifiably measurable as the not model, because that is not important to me, speed of action/development outcomes is more important to me, and because
2. I’ve observed a large gap between benchmark toppers and my own results
But make no mistake, I like have the terminals scrolling live across multiple monitors so I can glance at them periodically and watch their response quality, so I care and notice which give better/worse results.
My biggest goal right now after accuracy is achieving more natural human-like English for technical writing.