> Trying to run them on a unified memory Mac
> but still not quite in the realm of Sonnet or DeepSeek 4 Flash
these are not mutually exclusive anymore. DS4 has set the bar for me these days. https://github.com/antirez/ds4
> Trying to run them on a unified memory Mac
> but still not quite in the realm of Sonnet or DeepSeek 4 Flash
these are not mutually exclusive anymore. DS4 has set the bar for me these days. https://github.com/antirez/ds4
someone just put this on my radar yesterday, im about to try this today. how's your experience with it?
me thinks there's a lot of optimization strats we're currently leaving on the table just because the amount of things to explore and test are so expansive. but this one is super interesting targeting metal primarily and zeroing in on one model. instead of a one size fits all llama.cpp im very interested to see if theres a future where super tailor-made variants per model pans out to harnesses that can rapidly switch ultimately providing something akin to sonnet/early opus territory (that's my personal bench mark of good-enough i shall now cancel the hell out of this claude sub)
I'm on the verge of cancelling my anthropic $20 plan since it's come out. On an M5 Max 128GB, hooked up to the pi.dev harness, I get in the neighborhood of 400-450tps prefill and 30-35tps generation. It is imminently usable and at times feels more stable than my previous CC setup. Occasionally there are things it struggles with that I will bounce back over to CC for, but it is highly usable. The future is bright for local models! As a tinkerer, it makes me really happy to have a local setup I can be just as productive in, and not have the token overlords ready to shut me down at any time.
That's DS4 Flash right? How does it feel in intelligence and speed compared to DS4 Flash hosted by Deepseek themselves or another API provider? I've been using API DS4 Flash for a lot of personal projects and have been quite impressed. I've spent $1 on building ~10 toy projects and gotten them all to work within the bounds of what I wanted without having to do much besides guide the model away from dumb loops.
I'm using the DS4 flash IQ2 2-bit quant, per Salvadore's recommendations for my hardware in the repo. I haven't messed with the cloud hosted variant. The only other paid API I have messed with is a $20 Anthropic sub, primarily with whatever the latest version of Sonnet is. For the most part, this local configuration feels on par with that.
With this configuration (set up over the last month) I have been working on Python data processing tools, an internal Svelte 5/SvelteKit data intensive BI app, and some smaller Rust projects. It's been doing really well there.