Who's the target of 120B open-weights models? You can only run this in the cloud, is it just PR?
I wish they released a nano model for local hackers instead
Who's the target of 120B open-weights models? You can only run this in the cloud, is it just PR?
I wish they released a nano model for local hackers instead
You can run models the size of this one locally, even on a laptop, it's just not a great experience compared with an optimised cloud service. But it is local.
The size in bytes of this 120B model is about 65 GB according to the screenshot, and elsewhere it's said to be trained in FP4, which matches.
That makes this model small enough to run locally on some laptops without reading from SSD.
The Apple M2 Max 96GB from January 2023, which is two generations old now, has enough GPU-capable RAM to handle it, albeit slowly. Any PC with 96 GB of RAM can run it on the CPU, probably more slowly. Even a PC with less than 64 GB of RAM can run it but it will be much slower due to having to read from the SSD constantly.
If it's a 20B MoE, it will read about one fifth of the data per token, making it about 5x faster than a 120B FP4 non-MoE would be, but it still needs all the data readily available for multiple tokens.
Alternatively, someone can distill and/or quantize the model themselves to make a smaller model. These things can be done locally, even on a CPU if necessary if you don't mind how long it takes to produce the smaller model. Or on a cloud machine rented long enough to make the smaller model, which you can then run locally.
You can run it locally too. Below are a few of my local models, this is coming in light compared to them. At Q4 it's ~60B. Furthermore being a MoE, most of it can be in system memory and only the shared experts needs to go to GPU, provided you have a decent system with decent memory bandwidth, you can get decent performance. I'm running on GPUs, folks with Apple can run this if they have enough ram with minimal effort.
They are probably hoping that someone else will distill it into smaller models, much like DeepSeek released a giant 671B model but there are useful distillations down to 30B.
This sized model is trivial to run on a modern workstation
You'll have to define modern workstation for me, because I was under the impression that unless you've purpose-built your machine to run LLMs, this size model is impossible.
You can run a 4 bit quantized 120B model on a 96GB workstation card, the Blackwell Pro workstation, which are $7500. Considering the 5090 is bought by gamers for $3300 it’s definitely attainable, even though it’s obviously expensive.
I’m running a gaming rig and could swap one in right now without having to change anything compared to my 5090, so no $5000 Threadripper or a $1000 HEDT motherboard with a ton of RAM slots, just a 1000 watt PSU and a dream.
> 4 bit quantized 120B model on a 96GB workstation card, the Blackwell Pro workstation
Would be interesting to know how it performs in terms of quality and token/sec.
When people say "modern workstation" in context of LLM, they usually mean its consumer(pro-sumer?) grade hardware on a single machine. As opposed to racks of GPUs that you can even buy as a mere mortal (min order size)
It doesn't mean you can grab your work laptop from 5 years ago and run it there.
Get a Mac Studio with however much memory you need, and ideally an Ultra chip (for max memory bandwidth), and there's your workstation. I regularly run quantized 100b+ models on my M1 Ultra with 128Gb RAM.
For people who run stuff on the cloud?
They have a 20b for GPU poors, too.
I will be running the 120B on my 2x4090-48GB, though.