We made DeepSeek R1 run on a local device via offloading and 1.58bit quantization :) https://unsloth.ai/blog/deepseekr1-dynamic
I'm working on the new one!
We made DeepSeek R1 run on a local device via offloading and 1.58bit quantization :) https://unsloth.ai/blog/deepseekr1-dynamic
I'm working on the new one!
Your 1.58-bit dynamic quant model is a religious experience, even at one or two tokens per second (which is what I get on my 128 MB Raptor Lake+4090). It's like owning your own genie... just ridiculously smart. Thanks for the work you've put into it!
Likewise - for me, it feels how I imagined getting a microcomputer in the 70s was like. (Including the hit to the wallet… an Apple II cost the 2024 equivalent of ~$5k, too.)
:) The good ol days!
Oh thank you! :) Glad they were useful!
> 1.58bit quantization
of course we can run any model if quantize it enough. but I think the OP was talking about the unquantized version.
Oh you can still run them unquantized! See https://docs.unsloth.ai/basics/llama-4-how-to-run-and-fine-t... where we show you can offload all MoE layers to system RAM, and leave non MoE layers on the GPU - the speed is still pretty good!
You can do it via `-ot ".ffn_.*_exps.=CPU"`
Thanks, I'll try it! I guess "mixing" GPU+CPU would hurt the perf tho.
I use this a lot! Thanks for your work and looking forward to the next one
Thank you!! New versions should be much better!