Hacker News

It's a MoE (mixture of experts) architecture, which means that there's only 3.6 billion parameters activated per token (but a total of 20b parameters for the model). So it should run at the same speed that a 3.6b model would run assuming that all of the parameters fit in vRAM.

Generally, 20b MoE will run faster but be less smart than a 20b dense model. In terms of "intelligence" the rule of thumb is the geometric mean between the number of active parameters and the number of total parameters.

So a 20b model with 3.6b active (like the small gpt-oss) should be roughly comparable in terms of output quality to a sqrt(3.6*20) = 8.5b parameter model, but run with the speed of a 3.6b model.