I will answer for the 20B version on my RTX3090 for anyone who is interested (SUPER happy with the quality it outputs, as well). I've had it write a handful of HTML/CSS/JS SPAs already.
With medium and high reasoning, I will see between 60 and 120 tokens per second, which is outrageous compared to the LLaMa models I was running before (20-40tps - I'm sure I could have adjusted parameters somewhere in there).
Do we know why it’s so fast barring hardware?
Because he's getting crap output. Open source locally on something that under-powered is vastly worse than paid LLMs.
I'm no shill, I'm fairly skeptical about AI, but been doing a lot of research and playing to see what I'm missing.
I haven't bothered running anything locally as the overwhelming consensus is that it's just not good enough yet. And that from posts and videos in the last two weeks.
I've not seen something so positive about local LLMs anywhere else.
It's simply just not there yet, and definitely aren't for a 4090.
I don’t see how you can make these claims without having your own evals and running these models yourself. The gpt-oss results i’m getting for my use case, which is agentic task execution for a wide variety of tasks on my local device are spectacular, even more so when you stack them up against every model in the 20B weight class.
That's what I've been feeling too. But it is just a feeling. I'm not running any benchmarks.
My agentic coding "app" (basically just a tool "server" around dotnet/git/fs commands with a kanban board) seems to be able to spit out quick SPAs with little additional prompting.
That is a bit harsh. I'm actually quite pleased with the code it is outputting currently.
I'm not saying it is anywhere close to a paid foundation model, but the code it is outputting (albeit simple) has been generally well written and works. I do only get a handful of those high-thought responses before the 50k token window starts to delete stuff, though.
I guess I meant how is a 20b param model simply faster than another 20b model? What techniques are they using?
It's a MoE (mixture of experts) architecture, which means that there's only 3.6 billion parameters activated per token (but a total of 20b parameters for the model). So it should run at the same speed that a 3.6b model would run assuming that all of the parameters fit in vRAM.
Generally, 20b MoE will run faster but be less smart than a 20b dense model. In terms of "intelligence" the rule of thumb is the geometric mean between the number of active parameters and the number of total parameters.
So a 20b model with 3.6b active (like the small gpt-oss) should be roughly comparable in terms of output quality to a sqrt(3.6*20) = 8.5b parameter model, but run with the speed of a 3.6b model.