Out of sheer curiosity: What’s required for the average Joe to use this, even at a glacial pace, in terms of hardware? Or is it even possible without using smart person magic to append enchanted numbers and make it smaller for us masses?
Out of sheer curiosity: What’s required for the average Joe to use this, even at a glacial pace, in terms of hardware? Or is it even possible without using smart person magic to append enchanted numbers and make it smaller for us masses?
We made DeepSeek R1 run on a local device via offloading and 1.58bit quantization :) https://unsloth.ai/blog/deepseekr1-dynamic
I'm working on the new one!
Your 1.58-bit dynamic quant model is a religious experience, even at one or two tokens per second (which is what I get on my 128 MB Raptor Lake+4090). It's like owning your own genie... just ridiculously smart. Thanks for the work you've put into it!
Likewise - for me, it feels how I imagined getting a microcomputer in the 70s was like. (Including the hit to the wallet… an Apple II cost the 2024 equivalent of ~$5k, too.)
:) The good ol days!
Oh thank you! :) Glad they were useful!
> 1.58bit quantization
of course we can run any model if quantize it enough. but I think the OP was talking about the unquantized version.
Oh you can still run them unquantized! See https://docs.unsloth.ai/basics/llama-4-how-to-run-and-fine-t... where we show you can offload all MoE layers to system RAM, and leave non MoE layers on the GPU - the speed is still pretty good!
You can do it via `-ot ".ffn_.*_exps.=CPU"`
Thanks, I'll try it! I guess "mixing" GPU+CPU would hurt the perf tho.
I use this a lot! Thanks for your work and looking forward to the next one
Thank you!! New versions should be much better!
You can run the 4bit quantized version of it on a M3 Ultra 512GB. That's quite expensive though. Another alternative is a fast CPU with 500GB of DDR5 RAM. That of course, is also not cheap and slower than the M3 Ultra. Or, you buy multiple Nvidia cards to reach ~500GB of VRam. That is probably the most expensive option but also the fastest
If you use the excess memory for AI only it's cheaper to rent . A single H100 costs less than $2 per hour. (incl power)
Vast.ai has a bunch of 1x H100 SXM available, right now the cheapest at $1.554/hr.
Not affiliated, just a (mostly) happy user, although don't trust the bandwidth numbers, lots of variance (not surprising though, it is a user-to-user marketplace).
Every time someone asks me what hardware to buy to run these at home I show them how many thousands of hours at vast.ai you could get for the same cost.
I don't even know how these Vast servers make money because there is no way you can ever pay off your hardware from the pennies you're getting.
Worth mentioning that a single H100 (80-96GB) is not enough to run R1. You're looking at 6-8 GPUs on the lower end, and factor in the setup and download time.
An alternative is to use serverless GPU or LLM providers which abstract some of this for you, albeit at a higher cost and slow starts when you first use your model for some time.
Yeah, to run the full precision model you need either two 8xH100 nodes connected via Infiniband or one 8xH200 node or one 8xB200 node.
Not for the GPU poor, to be sure.
It is enough to run the dynamically quantised 1.56 bit version I believe, which is fun to play around with.
About 768 gigs of ddr5 RAM in a dual socket server board with 12 channel memory and an extra 16 gig or better GPU for prompt processing. It's a few grand just to run this thing at 8-10 tokens/s
About $8000 plus the GPU. Let's throw in a 4080 for about $1k, and you have the full setup for the price of 3 RTX5090. Or cheaper than a single A100. That's not a bad deal.
For the hobby version you would presumably buy a used server and a used GPU. DDR4 ECC Ram can be had for a little over $1/GB, so you could probably build the whole thing for around $2k
Been putting together a "mining rig" [1] (or rather I was before the tariffs, ha ha.) Going to try to add a 2nd GPU soon. (And I should try these quantized versions.)
Mobo was some kind of mining rig from AliExpress for less than $100. GPU is an inexpensive NVIDIA TESLA card that I 3D printed a shroud for (added fans). Power supply a cheap 2000 Watt Dell server PS off eBay....
[1] https://bsky.app/profile/engineersneedart.com/post/3lmg4kiz4...
This is the state of the art for such a setup. Really good performance!
https://github.com/kvcache-ai/ktransformers
I have a $2k used dual-socket xeon with 768GB of DDR4 - It runs at about 1.5 tokens/sec for the 4-bit quantized version.
It's probably going to be free at OpenRouter.
There's already a 685B parameter DeepSeek V3 for free there.
https://openrouter.ai/deepseek/deepseek-chat-v3-0324:free
It is free to use, but you're feeding OR data and someone is profiting off that.
Thats how a lot of application layer startups are going to make money. There is a bunch of high quality usage data. Either you monetize it yourself (cursor), get acquired (windsurf) or provide that data to others at a fee (lmsys, mercor). This is inevitable and a market for this is just going to increaase. If you want to prevent this as an org, there arent many ways out. Either use open source models you can deploy, or deal directly with model providers where you can sign specific contracts.
You're actually sending data to random GPUs connected to one of the Bittensor subnets that run LLMs.
That can, today, collect that data and sell it. There is work being done to add TEE, but it isn't live yet.
Not every prompt is privacy sensitive.
For example you could use it to summarize a public article.
Every prompt is valuable.
And you are getting something valuable in return. It's probably a good trade for many, especially when they are doing something like summarizing a public article.
I'm not so sure. I have agents that do categorization work. Take a title, drill through a browse tree to find the most applicable leaf category. Lots of other classification tasks that are not particularly sensitive and it's hard to imagine them being very good for training. Also transformations of anonymized numerical data, parsing, etc.
"one man's garbage is another man's treasure"
Using an AI for free is also valuable. Seems win/win.
This isn’t about reciprocal value. Even if something isn't privacy sensitive, it still holds value.
[dead]
Practically, smaller, quantized versions of R1 can be run on a pretty typically Macbook Pro setup. Quantized versions are definitely less performant, but they will absolutely run.
Truthfully, it's just not worth it. You either run these things so slowly that you're wasting your time or you have to buy 4- or 5-figures of hardware that's going to sit, mostly unused.
As mentioned you can run this on a server board with 768+ gb memory in cpu mode. Average joe is going to be running quantized 30b (not 600b+) models on an $300/$400/$900 8/12/16gb GPU
I'm not sure that's enough RAM to run it at full precision (FP8).
This guy ran a 4-bit quantized version with 768GB RAM: https://news.ycombinator.com/item?id=42897205
You can pay Amazon to do it for you at about a penny per 10 thousand tokens.
There's a couple of guides for setting it up "manually" on ec2 instances so you're not paying the Bedrock per-token-prices, here's [1] that states four g6e.48xlarge instances (192 vCPUs, 1536GB RAM, 8x L40S Tensor Core GPUs that come with 48 GB of memory per GPU)
Quick google tells me that g6e.48xlarge is something like 22k USD per month?
[0] https://aws.amazon.com/bedrock/deepseek/
[1] https://community.aws/content/2w2T9a1HOICvNCVKVRyVXUxuKff/de...
I'm sure it will be on OpenRouter within the next day or so. Not really practical to run a 685B param model at home.
Hardware: any computer from the last 20 or so years.
Software: client of choice to https://openrouter.ai/deepseek/deepseek-r1-0528
Sorry I'm being cheeky here, but realistically unless you want to shell out 10k for the equivalent of a Mac Studio with 512GB of RAM, you are best using other services or a small distilled model based on this one.
> even at a glacial pace
If speed is truly not an issue, you can run Deepseek on pretty much any PC with a large enough swap file, at a speed of about one token every 10 minutes assuming a plain old HDD.
Something more reasonable would be a used server CPU with as many memory channels as possible and DDR4 ram for less than $2000.
But before spending big, it might be a good idea to rent a server to get a feel for it.
I'm using GPT4All with DeepSeek-R1-Distill-QWen-7B (which is not R1-0528) on a Ryzen 5 3600 with 32Gb ram.
With an average of 3.6 tokens/sec, answers usually take 150-200 seconds.