I was looking into self-hosting deekseek v4 pro since frankly cache reads are an absolute scam and they're 90% of the cost, but then I looked at the ROI and it will never pay off fast enough because the hardware will become obsolete faster even if you were running 10 token generation streams 24/7.
The napkin math resulted that renting is around 27 times cheaper than owning (not including power). I think we're really screwed when it comes to having owned access to AI unless intel comes out swinging with a c series card that has 128gb vram so we can run them in a 4x128gb configuration, but seems unlikely since nvidia has a large share in them.
This was calculated expecting around 30tok/s, of course you can get 2-5tok/s much much cheaper, but it's unusable for my workflow.
Ironically the few people not scamming you for cache reads are Deepseek.
Everyone else charges a ridiculous amount but Deepseeks API is $0.003625 / M tok.
I'm surprised no one talks about this because of how significant it is. GPT 5.5 for example costs a ridiculous $0.50 / M tok cached. It's literally almost 140 times cheaper which matters a lot for tool calls.
it's a temporary promo, deepseek will return to only 10x cheaper after.
Yes Deepseek V4 pro is currently on discount.
> The deepseek-v4-pro model is currently offered at a 75% discount, extended until 2026/05/31 15:59 UTC.
However even when the discount ends its still very cheap. It will go back to $0.0145 / M cache hit. That's still 34x cheaper than GPT 5.5.
doesn't matter when subscriptions get cache reads for free, it is only really worth it if it's x340 cheaper otherwise I'd be paying $120 a day, 90% of the cost being cache reads for any top level opensource model.
The only way to profitable serve AI is to have large batch sizes - run 500 requests at the same time.
If you serve a single user you'll never get your electricity price back, nevermind hardware costs.
Would you mind sharing the napkin maths?
Not OP, but basically take GiB/s and divide by 30. You need at least 128GiB to hold the model, too. It's expensive to get 200 GiB/s, very expensive to get 400 GiB/s and above that you are looking at DC-grade GPUs. Multiple, in fact.