Tomorrow NVIDIA will publish Nemotron 3 Ultra, which will be the biggest open weights LLM from a US company (550B parameters).

The early testers have confirmed that it is much better than all earlier US open weights models, but it is not as good as the best Chinese open weights models.

While Nemotron 3 Ultra is not the smartest open weights LLM, it is well optimized for fast inference, so it is much faster than the other LLMs of the same size.

In any case I believe that it is very good to have an additional option in big open weights LLMs, because until now all existing models have shown that even if some model is definitely better on average than another, the weaker model can still be better in some particular applications.

With open weights models, you can afford to try multiple LLMs for the more important tasks and then choose the best solution.

NVIDIA seem to be following a smart Intel-like strategy of selling chips and also creating software that helps create demand for those chips. With Intel it was things like MKL, IPP, OpenCV etc, and with NVIDIA it is not just CUDA and development libraries but also models like Nemotron.

The pure-AI companies like OpenAI and Anthropic are hoping to sell you API access to cloud-based AI, perhaps running on NVIDIA chips, but it seems NVIDIA's plan may be for you to run local AI, maybe from NVIDIA, running on local NVIDIA chips.

> it is well optimized for fast inference

do you have any insight into the actual technical details that make this sort of things possible? I want to learn more about model architectures. Does it have to do with attention mechanisms or sparsity or something?