Hacker News

hei-lima 10 hours ago [ - ]

We need another "Deepseek moment" or else it will become impossible for the regular dude to use AI. It will become something that only big companies can afford.

SwellJoe 9 hours ago [ - ]

We're having DeepSeek moments every couple of weeks.

Qwen 3.6 hit hard in the self-hosting space. It's incredibly capable for its size, really shaking up what's possible in 64GB or even 32GB of VRAM.

The Prism Bonsai ternary model crams a tremendous amount of capability into 1.75GB.

And, DeepSeek V4 is crazy good for the price. They're charging flash model prices for their top-tier Pro model, which is competitive with the frontier of a few months ago.

The winners in the AI war will be the companies that figure out how to run them efficiently, not the ones that eke out a couple percent better performance on a benchmark while spending ten times as much on inference (though the capability has to be there, I think we're seeing that capability alone isn't a strong moat...there's enough competent competition to insure there's always at least a few options even at the very frontier of capability).

Zambyte 9 hours ago [ - ]

> It's incredibly capable for its size, really shaking up what's possible in 64GB or even 32GB of VRAM.

You can lower that to at least 24GB. I've been running Qwen 3.5 and 3.6 with codex on a 7900 XTX and the long horizon tasks it can handle successfully has been blowing my mind. I would seriously choose running my current local setup over (the SOTA models + ecosystem) of a year ago just based on how productive I can be.

hei-lima 5 hours ago [ - ]

Gonna try it.

trollbridge 8 hours ago [ - ]

We have Qwen 3.6-35b (6) on a 5090 (32GB) and it's blowing me away. Works fine for most (not all) code generation tasks. One developer here has been extremely stubborn about adopting AI; he's finally adopted it, albeit only when it's coming from a local model like this.

DeepSeek V4 Pro likewise is insanely good for the price. I simply point it at large codebases, go get a cup of coffee or browse Hacker News, and then it's done useful work. This was simply not possible with other models without hitting budget problems.

akulbe 7 hours ago [ - ]

Any chance you'd be willing to talk further about your setup? I have 2 x 3090s in a local machine, and I'm still left with questions about how best to use stuff locally.

sheeshkebab 5 hours ago [ - ]

You can only run heavily quantized models on all 3/4/5 rtx gpus (with 32gb or less vram) - and you probably want moe versions like Qwen 35b for this to run at speed somewhat comparable to Claude. It’s still not there to be honest but getting there. Personally I mess around with llama.cpp on m5 max with 128gb - it’s a decent setup to try various medium sized things, and runs llms surprisingly well without quantization, at least the moe models.

akulbe an hour ago [ - ]

How is that machine for local inference? It's a serious consideration for me, but getting to hear more from folks that already have it would be helpful.

SwellJoe 5 hours ago [ - ]

Two 3090s is 48GB, so it's possible to run the 6-bit quantization comfortably, which is fine. It doesn't start to get notably dumber until lower than that. It won't be as fast as a hosted model, but dual 3090s will be comfortably fast for interactive use with the MoE version and not terrible to use with the dense model. I run the dense model at 8 bits on my dual Radeon V620 desktop machine, which I think would be slower than two 3090s, or at least not notably faster.

hedgehog 5 hours ago [ - ]

Have you done comparisons with 4 bit and seen a noticeable difference for coding tasks?

SwellJoe 3 hours ago [ - ]

No, I've just seen benchmarks showing most models start degrading around 4-5 bits. That's not to say they become useless, just that down to about 6-bits (with careful hybrid quantizations like unsloth where some of the layers aren't quantized or are quantized at higher bit depths) the quality isn't measurably degraded, but below that there are measurable differences in performance.

People report good results from DeepSeek V4 Flash at 2 bits (the DwarfStar 4 folks are doing it, and I've tried it on my Strix Halo, but it's too slow to be usable, so I haven't bothered to figure out if it's actually smart enough to use for anything).

Anyway, it's obvious models have to degrade in terms of knowledge, at any quantization, even though it may not show up clearly on benchmarks until lower. If you halve the size of the data available, it necessarily loses information about the world.

hedgehog 28 minutes ago [ - ]

The data I've seen is stuff like the KL Divergence comparisons that Unsloth does which show something but not clearly whether there's an observable or significant difference in task performance.

akulbe an hour ago [ - ]

One of the things I'm wondering about is what I'm missing for $LLM to create files on the local FS like Claude and Codex do. What I see instead is stuff just printing to stdout, rather than files on the filesystem.

What am I missing?

SwellJoe 18 minutes ago [ - ]

You're missing an agent. The model uses tool calls to interact with the filesystem, commands on the system, optionally search (you need a search MCP server, like Brave or Exa, and API key), etc.

I usually use the Zed Agent built into Zed editor for self-hosted models, but you could use Pi, OpenCode, Hermes, Claude Code, etc. there are many, many, agents.

hedgehog 32 minutes ago [ - ]

The model just predicts text, Claude Code etc parse the output and do the actual file creation (or run shell commands that do it). If you have Claude Code installed look in ~/.claude/projects/... and you can see the transcripts of your actual sessions, or install Mini-SWE-Agent and play with that to get a feel for what's going on.

squidbeak 10 hours ago [ - ]

Deepseek had another moment a few weeks ago. V4 isn't far behind the US frontier, and so far its flash variant seems a very reliable coder and costs a pittance.

ai_fry_ur_brain 10 hours ago [ - ]

Deepseek V4 (not flash) trippled in price too by the way (from Deepseek). Get used to this pattern.

This is what you get for relying on the generosity of billionaires. Keep offshoring your thinking ability to a machine and let me know how competitive you. Hint, you wont be. There's nothing special about being able to use an LLM.

barrell an hour ago [ - ]

Actually, deepseek v4 was 1/3 promotional price for the first month or so. This was pretty clearly communicated. The promotions window just ended is all.

npn 10 hours ago [ - ]

Unlike other providers, Deepseek does promise that they will lower the price when their Huawei cards arrive in a few more months.

flakiness 8 hours ago [ - ]

Give me a link. Cannot wait. One PSA is that they have 75% discount right now so it is already cheaper than the full price.

npn 8 hours ago [ - ]

Weird, last time I checked it was right on the pricing page.

But even when it happens I doubt it would be as cheap as it is right now. Enjoy it while it lasts!

ls612 10 hours ago [ - ]

Anyone can host Deepseek V4 on rented GPUs and sell inference on it. Price will very quickly converge to the marginal cost of inference. This is as close to a pure commodity as it gets in the AI space so competitive market economics will put in work. Same is true for any open-weights model.

ai_fry_ur_brain 9 hours ago [ - ]

You dont understand the costs involved to run inference at scale

Please go run some numbers.The hardware needed to Run Deepseek v4 flash at 20 tps for a single session is nowhere close to what is required to run it at 50tps for 5,000 concurrent sessions.

Imagine what it takes to be profitible when running at 150 tps for 30cents per 1mm. You make less than 1k per month and the hardware required to run that cost 10k a month to rent with hardly any concurrent session capability.

gpugreg 7 hours ago [ - ]

> Please go run some numbers.

- DeepSeek serves DeepSeek V4 Pro at 27 tps: https://openrouter.ai/deepseek/deepseek-v4-pro

- At 27 tps per user, a B300 GPUS will give you around 800 tokens per second (serving 30 users): https://developer-blogs.nvidia.com/wp-content/uploads/2026/0...

- That's 800 * 60 * 60 generated tokens per hour, at a cost of $0.87 per 1M tokens, or $2.50 per hour.

- For input and output tokens, the math is a bit more complicated because we have to make assumptions about their ratio. Using the published values from OpenCode, we get another $2.50 for cached tokens (which are almost free for DeepSeek) and another $3.40 for input tokens (which are a lot cheaper to compute than output tokens), which gives us a total of $8.50 per hour per B300 GPU.

- B300 GPUs can be rented for as low as $3.40 per hour, which is less than $8.50, so hosting DeepSeek V4 Pro is profitable.

You could also host it at fewer tps per user to raise the efficiency and therefore the profit even higher.

ls612 7 hours ago [ - ]

Even not assuming Blackwell inference the $3.50/hr price is likely close to the marginal cost. The Deepseek R0 model is a little more than a third of the size of V4 and cost around $1/Mtok to serve at scale based on deepseek's blogs last year and Hopper rental prices.

ls612 9 hours ago [ - ]

Yes it is more efficient in $/tok to run at scale than to run just for yourself. Everyone selling Deepseek V4 inference is selling an undifferentiated good. They have run the numbers on how much it costs and are competing against a dozen other outfits also selling undifferentiated open weights tokens. Whatever the dollar cost they face to rent those GPUs will be what they are able to charge in the competitive market. That is great for you and me because we can buy tokens at pretty much exactly what it costs to produce them.

dpoloncsak 10 hours ago [ - ]

Mate why are you so mad at people upset the price trippeled? It's a fair complaint that people built services using the cheaper ones with the expectation future models would be similarly priced. You can avoid 'offloading thinking' while still building ontop of these models

zaptrem 8 hours ago [ - ]

V4-Pro is about 2.4× total params and 1.3× active params of V3.2.

creationcomplex 7 hours ago [ - ]

You're typing as your handwriting and letter sending abilities deteriorate to dust. Writing down information as your memory capacity decays. Remembering instead of living at the pure leading edge of perception dulling your reactions.

Smh, it's all downhill from the first unadulterated neuron.

aurareturn 10 hours ago [ - ]

I think demand is too great and compute is not enough. Nothing to do with billionaires colluding to increase prices by 3x.

boutell 7 hours ago [ - ]

Actually, why should Google collude on pricing? They have deep pockets and could starve out the competition while keeping prices low, if they really wanted.

I think it is priced high because it's basically their smartest model as well as their fastest, so why shouldn't they?

You can still use earlier generations of Flash at a lower cost if you want "fast and cheap and just OK," which often makes sense. (Just checked)

I would predict they will lower this price when 3.5 High appears, but perhaps not all the way.

xbmcuser 9 hours ago [ - ]

What we need is a deepseek moment in hardware ie China reaching parity on node size that is the only way latest computers let alone latest ai will be available to us in the future otherwise the profit margins will push most production to AI.

throwa356262 9 hours ago [ - ]

To be honest, China not having access to the latest hardware is exactly what has driven LLM technology forward the last 2 years.

humanfromearth9 9 hours ago [ - ]

Why?

Weryj 9 hours ago [ - ]

Because it forced them to focus on efficiency, instead of throwing more compute at the problem.

Just like in software, some of the most beautiful solutions come from constraints. Think, the optimisations that game developers implemented because of the frame budget.

Viacol 4 hours ago [ - ]

On top of that, China is also facing hardware constraints, which is pushing companies to develop better domestic chips for AI training. It'll be interesting to see how things perform once Huawei's newer hardware is fully deployed at DeepSeek.

blackoil 2 hours ago [ - ]

Open Source ASML EUV. But will wipe off trillions from US stocks so 401k may not like that.

stared 7 hours ago [ - ]

We have a "DeepSeek moment", https://github.com/antirez/ds4 (see https://news.ycombinator.com/item?id=48142108).

Or if you prefer smaller ones, Qwen3.6-35B-A3B, https://huggingface.co/bartowski/Qwen_Qwen3.6-35B-A3B-GGUF

segmondy 10 hours ago [ - ]

You can use lots of open weight models today.

hei-lima 10 hours ago [ - ]

That's one solution to the problem. But it still needs some good computational capabilities. Either we optimize the hell out of those models, or we wait for the hardware to become good enough for them.

Gigachad 7 hours ago [ - ]

The real problem is the hardware to run them is still very expensive.

pianopatrick 9 hours ago [ - ]

Maybe we can figure out better ways to use the models that can run on cheap hardware.

GeorgeOldfield 10 hours ago [ - ]

gemini isn't even that good. just tested 3.5 on usual complex prompts to opus/chat 5.5. meh

k8sToGo 9 hours ago [ - ]

Are you really comparing flash to opus? Shouldn't you be comparing pro?

CognitiveLens 9 hours ago [ - ]

The benchmark tables in the Google announcement include Opus 4.7, and the numbers are very impressive. Caveat emptor, but it's not unreasonable to compare a new Flash to a current-gen Opus, even if some of the results confirm expectations

bachmeier 9 hours ago [ - ]

Who would have guessed that something costing roughly a third as much wouldn't do as well at certain tasks.

kmac_ 9 hours ago [ - ]

Well, the first impression is that Gemini still goes off the instruction rails easier than other models, but I noticed that it tends to go back to the initial goal without holding a hand, which is a real improvement. It's really interesting that these models behave so differently.