Hacker News

You can run models the size of this one locally, even on a laptop, it's just not a great experience compared with an optimised cloud service. But it is local.

The size in bytes of this 120B model is about 65 GB according to the screenshot, and elsewhere it's said to be trained in FP4, which matches.

That makes this model small enough to run locally on some laptops without reading from SSD.

The Apple M2 Max 96GB from January 2023, which is two generations old now, has enough GPU-capable RAM to handle it, albeit slowly. Any PC with 96 GB of RAM can run it on the CPU, probably more slowly. Even a PC with less than 64 GB of RAM can run it but it will be much slower due to having to read from the SSD constantly.

If it's a 20B MoE, it will read about one fifth of the data per token, making it about 5x faster than a 120B FP4 non-MoE would be, but it still needs all the data readily available for multiple tokens.

Alternatively, someone can distill and/or quantize the model themselves to make a smaller model. These things can be done locally, even on a CPU if necessary if you don't mind how long it takes to produce the smaller model. Or on a cloud machine rented long enough to make the smaller model, which you can then run locally.