You'll have to define modern workstation for me, because I was under the impression that unless you've purpose-built your machine to run LLMs, this size model is impossible.
You'll have to define modern workstation for me, because I was under the impression that unless you've purpose-built your machine to run LLMs, this size model is impossible.
You can run a 4 bit quantized 120B model on a 96GB workstation card, the Blackwell Pro workstation, which are $7500. Considering the 5090 is bought by gamers for $3300 it’s definitely attainable, even though it’s obviously expensive.
I’m running a gaming rig and could swap one in right now without having to change anything compared to my 5090, so no $5000 Threadripper or a $1000 HEDT motherboard with a ton of RAM slots, just a 1000 watt PSU and a dream.
> 4 bit quantized 120B model on a 96GB workstation card, the Blackwell Pro workstation
Would be interesting to know how it performs in terms of quality and token/sec.
When people say "modern workstation" in context of LLM, they usually mean its consumer(pro-sumer?) grade hardware on a single machine. As opposed to racks of GPUs that you can even buy as a mere mortal (min order size)
It doesn't mean you can grab your work laptop from 5 years ago and run it there.
Get a Mac Studio with however much memory you need, and ideally an Ultra chip (for max memory bandwidth), and there's your workstation. I regularly run quantized 100b+ models on my M1 Ultra with 128Gb RAM.