This is why it’s so critical to have open source models.
In a year or so, the open source models will become good enough (in both quality and speed) to run locally.
Arguably, OpenAI OSS 120B is already good enough, in both quality and speed, to run on Mac Studio.
Then $10k, amortized over 3 years, will be enough to run code LLMs 24/7.
I hope that’s the future.
Open source models could be run by low-cost cloud providers, too. They could offer discounts for a long term contract and run it on dedicated hardware.
This. Your local LLM, even if shared between a pool of devs, is probably only going to be working 8 hours a day. Better to use a cloud provider, especially if you can find a way to ensure data security, if that is an issue for you.
Exactly. There is no shortage of providers hosting open source models with per-token pricing, with a variety of speeds and context sizes at different price points. Competition is strong and barriers of entry low, ensuring that margins stay low and prices fair.
If you want complete control over your data and don't trust anyone's assurances that they keep it private (and why should you) then you have to self-host. But if all you care about is a good price then the free market already provides that for open models
Hetzer already offers dedicated GPU servers from 180eur/month
Hetzner and Scaleway already do instances with GPUs so this kinda already exists
In fact, does anybody want to hire a server with me? I suspect it'll work out cheaper than Claude max etc: a server from hetzner starts at £220ish: https://www.hetzner.com/dedicated-rootserver/matrix-gpu/
It might be fun to work out how to share, too. A whole new breed of shell hosting.
I have a couple of non-GPU servers with them and quite a few Heztner Cloud projects, but I never understood their GPU offering. They have just two (small) VRAM sizes and you pay per month, whereas the ones like Runpod have a large selection of whatever you need, and they are cheaper, and you can rent them for a shorter period like two weeks, no setup time. Am I missing something?
There are use-case for smaller models that fit in 20gb VRAM, like in database sentiment analysis and such.
Sure, my point you can get these cheaper outside of Hetzner, so I don't really understand who are these for.
Every business building on LLMs should also have a contingency plan for if they needed to go to an all open-weights model strategy. OpenAI / Anthropic / Google have nothing stopping them from 100x-ing the price or limiting access or dropping old models or outright competing with their customers. Building your whole business on top of them will prove to be as foolish as all of the media companies that built on top of Facebook and got crushed later.
Couldn't you also make this argument about cloud infrastructure from the standard hyperscaler cloud providers (AWS, GCP, ...)? For that matter, couldn't you make this argument about dependency your business has which it purchases from other businesses which are competing against each other to provide it?
In general, you are right, but AI as a field is pretty volatile still. Token producers are still pivoting and are generally losing money. They will have to change their strategy sooner or later, and there is a good chance that the users will not be happy about it.
Are the cloud computing providers burning tens of millions of dollars each day and having to resort to NBA-player level salaries to recruit talent?
AWS/GCP are at least making money with their current pricing model.
When your provider is dumping at a loss, it's their way of saying that the business plan is to maximize lock-in/monopoly effects followed by the infamous "enshittification".
I mean in most large businesses the companies are performing risk analysis for exactly this.
OpenAI / Anthropic / Google have nothing stopping them from 100x-ing the price
There is also nothing stopping this silly world from breaking out into a dispute where chips are embargoed. Then we'll have high API prices and hardware prices (if there's any hardware at all). Even for the individual it's worth having that 2-3k AI machine around, perhaps two.
> OpenAI / Anthropic / Google have nothing stopping them from 100x-ing the price
presumably... capitalism still exists?
Many of the larger enterprises (retail, manufacture, insurance, etc) are consistently becoming cloud-only or have reduced their data center foot print massively over the last 10 years.
Do you think these enterprises will begin hosting their own models? I'm not convinced they'll join the capex race to build AI data centers. It would make more sense they just end up consuming existing services.
Then there are the smaller startups that just never had their own data center. Are those going to start self-hosting AI models? And all of the related requirements to allow say a few hundred employees to access a local service at once? network, HA, upgrades, etc. Say you have multiple offices in different countries also, and so on.
> manufacture
They're much less strict than they were on cloud, but the security practices are really quite strict. I work in this sector and yes, they'll allow cloud, but strong data isolation + segregation, access controls, networking reqs, etc. etc. etc. are very much a thing in the industry still, particularly where the production process is commercially sensitive in itself.
> Do you think these enterprises will begin hosting their own models? I'm not convinced they'll join the capex race to build AI data centers. It would make more sense they just end up consuming existing services.
they already are
Enterprises (depending on the sector, think semi manufacturing) will have no choice for two reasons:
1. Protecting their intellectual property, and
2. Unknown “safety” constraints baked in. Imagine an engineer unable to ran some security tests because LLM thinks it’s “unsafe”. Meanwhile, VP of Sales is on the line with the customer.
I am looking forward for the AMD 395 max+ PCs to come down in price.
The inference speed locally will be acceptable in 5-10 years thanks to those generation of chips and finally we can have good local AI apps.
They don’t have the memory bandwidth
256 GB/s memory bandwidth is low but it still does around 40 tokens per sec with gpt-oss, good enough for local apps
> In a year or so, the open source models will become good enough (in both quality and speed) to run locally.
"Good enough" for what is the question. You can already run them locally, the problem is that they aren't really practical for the use-cases we see with SOTA models, which are just now becoming passable as semi-reliable autonomous agents. There is no hope of running anything like today's SOTA models locally in the next decade.
GPU compute per dollar has been on a pretty steady curve of around 10x per decade. In ML for computer vision we were also able to make models around 10x as efficient per 10 years. I think with these two factors combined, mapped to LLMs, we will be able go match the performance of say Sonnet 4 on a 2000 USD workstation well within 10 years from today.
Even with 10x more efficient models and GPU compute, hundreds of GB of VRAM will still be on the order of tens of thousands of USD for the foreseeable future.
How long is foreseeable future? In 10 years I think LLM accelerator (GPU/NPU/etc) with 100 GB VRAM will cost under 2000 USD.
VRAM prices have remained flat for the last decade, so no evidence of that coming.
Beyond that, running inference on the equivalent of a 2025 SOTA model with 100GB of VRAM is very unlikely. One consistent quality of transformer models has been the fact that smaller and quantized models are fundamentally unreliable, even though high quality training data and RL can boost the floor of their capabilities.
GDDR6 8Gb spot (DRAMExchange) is now around 2.6 USD, down from 3.5 USD in summer 2023, and 6 USD in summer 2022? Last year has been pretty flat though!
they might be passable, but there's zero chance they're economical atm.
IMO local models is kind of inevitable.
Hardware vendors will create efficient inference pcie chips and innovations in ram architecture will make make even mid-level devices capable of running local 120B parameter models efficiently.
Open source models will get good enough that there isn’t a meaningful difference between them and the closed source offerings.
Hardware is relatively cheap, it’s just that vendors haven’t had enough cycles yet on getting local inference capable devices out to the people.
I give it 5 years or so before this is the standard
What's performance of running OpenAI OSS 120B on a Mac Studio as compared to running a paid subscription frontier LLM?
I will answer for the 20B version on my RTX3090 for anyone who is interested (SUPER happy with the quality it outputs, as well). I've had it write a handful of HTML/CSS/JS SPAs already.
With medium and high reasoning, I will see between 60 and 120 tokens per second, which is outrageous compared to the LLaMa models I was running before (20-40tps - I'm sure I could have adjusted parameters somewhere in there).
Do we know why it’s so fast barring hardware?
Because he's getting crap output. Open source locally on something that under-powered is vastly worse than paid LLMs.
I'm no shill, I'm fairly skeptical about AI, but been doing a lot of research and playing to see what I'm missing.
I haven't bothered running anything locally as the overwhelming consensus is that it's just not good enough yet. And that from posts and videos in the last two weeks.
I've not seen something so positive about local LLMs anywhere else.
It's simply just not there yet, and definitely aren't for a 4090.
I don’t see how you can make these claims without having your own evals and running these models yourself. The gpt-oss results i’m getting for my use case, which is agentic task execution for a wide variety of tasks on my local device are spectacular, even more so when you stack them up against every model in the 20B weight class.
That's what I've been feeling too. But it is just a feeling. I'm not running any benchmarks.
My agentic coding "app" (basically just a tool "server" around dotnet/git/fs commands with a kanban board) seems to be able to spit out quick SPAs with little additional prompting.
That is a bit harsh. I'm actually quite pleased with the code it is outputting currently.
I'm not saying it is anywhere close to a paid foundation model, but the code it is outputting (albeit simple) has been generally well written and works. I do only get a handful of those high-thought responses before the 50k token window starts to delete stuff, though.
I guess I meant how is a 20b param model simply faster than another 20b model? What techniques are they using?
It's a MoE (mixture of experts) architecture, which means that there's only 3.6 billion parameters activated per token (but a total of 20b parameters for the model). So it should run at the same speed that a 3.6b model would run assuming that all of the parameters fit in vRAM.
Generally, 20b MoE will run faster but be less smart than a 20b dense model. In terms of "intelligence" the rule of thumb is the geometric mean between the number of active parameters and the number of total parameters.
So a 20b model with 3.6b active (like the small gpt-oss) should be roughly comparable in terms of output quality to a sqrt(3.6*20) = 8.5b parameter model, but run with the speed of a 3.6b model.
Chiming in here, M1 Max MacBook Pro 64GB using gpt-oss:20b over ollama with Visual Studio Code with GitHub Copilot is unusably slow compared to using Claude Sonnet 4, which requires (I think?) GitHub Copilot Pro.
But I'm happy to pay the subscription vs buying a Mac Studio for now.
Ollama's implementation for gpt-oss is poor.
Even if they do get better. The latest closed-source {gemini|anthropic|openai} model will always be insanely good and it would be dumb to use a local one from 3 years back.
Also tooling, you can use aider which is ok. But claude code and gemini cli will always be superior and will only work correctly with their respective models.
I don’t know about your first point: at some point the three-year difference may not be worth the premium, as local models reach “good enough.”
But the second point seems even less likely to be true: why will Claude code and Gemini cli always be superior? Other than advantageous token prices (which the people willing to pay the aforementioned premium shouldn’t even care about), what do they inherently have over third-party tooling?
Even using Claude Code vs. something like Crush yields drastically different results. Same model, same prompt, same cost... the agent is a huge differentiator, which surprised me.
I totally agree that the agent is essential, and that right now Claude Code is semi-unanimously the best agent. But agentic tooling is written, not trained (as far as I can tell—someone correct me) so it’s not immediately obvious to me that a third-party couldn’t eventually do it better.
Maybe to answer my own question, LLM developers have one, potentially two advantages over third-party tooling developers: 1) virtually unlimited tokens, zero rate limiting with which to play around with tooling dev. 2) the opportunity to train the network on their own tooling.
The first advantage is theoretically mitigated by insane VC funding, but will probably always be a problem for OSS.
I’m probably overlooking news that the second advantage is where Anthropic is winning right now; I don’t have intuition for where this advantage will change with time.
Depends on your use case though. You don't always need the best. Even if you have a hypercar, you probably drive a regular car to work.
There's also a personal good enough point for everyone who's hoping to cut the cord and go local. If local models get as good as current moments Claude Sonnet, I would actually be totally fine using that locally and riding the local improvements from then on.
And for local stuff like home automation or general conversational tasks, local has been good enough for a while now. I don't need the hypercar of LLMs to help me with cooking a recipe for example.
> Even if they do get better. The latest closed-source {gemini|anthropic|openai} model will always be insanely good and it would be dumb to use a local one from 3 years back.
If they use a hosted model, they’ll probably pin everything initially to, at best, the second newest model from their chosen provider (the newest being insufficiently proven) and update models to something similarly behind only when the older model goes completely out of support.
I use Claude Code with other models sometimes.
For well defined tasks that Claude creates, I'll pass off execution to a locally run model (running in another Claude Code instance) and it works just fine. Not for every task, but more than you might think.
Why bother mentioning this model? From what I've seen, it only excels at benchmarks. Qwen3 is sorta where its at right now; Qwen3-Coder is pretty much at "summer intern" level for coding tasks, and its ahead of the rest.
Shame anyone is actually _paying_ for commercial inference, its worse than whatever you can do locally.
Problem is that it really eats all resources when using a llm locally. I tried it. But the whole system becomes unresponsive and slow. We need minimum of 1tb memory and dedicated processors to offload.
After trying gpt-oss:20b, I'm starting to lose faith in this argument, but I share your hope.
Also, I've never tried really huge local models and especially not RAG with local models.
It's not hard to imagine a future where I license their network for inference on my own machine, and they can focus on training.
>In a year or so, the open source models will become good enough (in both quality and speed) to run locally.
Based on what?
And where? On systems < 48GB?
Its not, capitalism isn't about efficiency; it's about lockin. You can't lockin open source models. If fascism under republicans continue, you can bet they'll be shut down due to child safety or whatever excuse the large corporations need to turn off the free efficiency.
This is unrealistic hopium, and deep down you probably know it.
There's no such thing as models that are "good enough". There are models that are better and models that are worse and OS models will always be worse. Businesses that use better, more expensive models will be more successful.
> Businesses that use better, more expensive models will be more successful.
Better back of house tech can differentiate you, but startups history is littered with failed companies using the best tech, and they were often beaten by companies using a worse is better approach. Anyone here who has been around long enough has seen this play out a number of times.
> startups history is littered with failed companies using the best tech, and they were often beaten by companies using a worse is better approach.
Indeed. In my idealistic youth I bought heavily into the "if you build it, they will come," but that turned out to not at all be reality. Often times the best product loses because of marketing, network effects, or some other reason that has nothing to do with the tech. I wish it weren't that way, but if wishes were fishes we'd all have a fry
Sometimes the best tech is too early and too expensive.
Most tech hits a point of diminishing returns.
I don't think we're there yet, but it's reasonable to expect at _some point_ your typical OS model could be 98% of the way to a cutting edge commercial model, and at that point your last sentence probably doesn't hold true.
There is cost/benefit analysis yes but fundamentally there are no diminishing returns because intelligence does not have diminishing returns.
One of the key concepts in the AI zeitgeist is the possibility of superintelligence. There will always be the possibility of a more productive AI agent.
There is a sweet spot, and at 100k per dev per year some businesses may choose lower priced options.
The business itself will also massively develop in the coming years. For example, there will be dozens of providers for integrating open source models with an in-house AI framework that smoothly works with their stack and deployment solution.
I agree. It isn't in the interest of any actor including openai to give out their tools for free.
[dead]
Most devs where I'm from would scrape to cough up that amount
More niche use case models have to be developed for cheaper and energy optimized hardware.
└── Dey well
This would be a business expense. Compared to hiring a developer for a year, it would be more reasonable.
For a short-term gig, though, I don’t think they would do that.