The point is not to be as good as the multi-trillion parameter model you can host in across 72 GPUs (or whatever).
I'm running a 248B model on a paltry amount of hardware and getting plenty of good use out of it.
Sure, the most demanding tasks will demand the best models (and always will). There's still less demanding tasks for other models.
I think some people are fooling themselves that coding of all tasks is always going to requires the biggest models ever. Again, maybe some coding tasks will, but the majority of business CRUD apps probably don't. Same goes for virtually any other type of task. The biggest models are really only useful for the most complex tasks.
If you wouldn't mind, could you explain a bit what the 248B model is good for, and where it breaks down and you need something better? I hear this take often, but it is always a fleeting remark so I have no idea what the 'useful' looks like - at all.
To answer this and my sibling, it's DeepSeek V4 Flash at native FP4 quantization, on two Nvidia DGX Sparks. Which is a bit of kit but still paltry relative to the data centre. ~40 TPS generation, ~2000 TPS prompt processing, which makes it feel approximately as fast as typical APIs.
I primarily use it with my own harness for coding. I'm not going to say it will compete with Opus in the most challenging domains, because it won't, but I will say that there's a reasonable likelihood that Opus is used for tasks that a model like Flash could comfortably handle at 1/100th the cost.
So far I've only seen it struggle at tasks that I myself would struggle with. Tasks that I can describe the shape of the solution for, it has a high success rate at implementing.
Useful is going to be different for everyone. I'm not working on the hardest problems, I don't need the best models.
In my experience they require much more hand holding and more specific directions with less possibilities to interpret a command in several ways. You do the planning, keep on eye on that they're producing and they do the legwork. It's not that their knowledge of Java or PHP or what have you is lacking, it's the long horizon planning that you have to do yourself. Technically they're good. You just have to do more thinking and more reviewing yourself. YMMV.
Depending on quantization I figure they need at least a p4 and likely a p5 EC2 (or similar instance in another provider) for a model with that many parameters. Maybe they are hosting on bare metal but I imagine not. Those instance types (assuming not using spot) are quite expensive to run.