A major current problem is that we're smashing gnats with sledgehammers via undifferentiated model use.
Not every problem needs a SOTA generalist model, and as we get systems/services that are more "bundles" of different models with specific purposes I think we will see better usage graphs.
Because none of them are good enough yet to trust completely with any task. Even the absolute best ones still fart out at surprising times, and for most stuff I have an AI that's always on, it requires no cognitive overhead to delegate to my own brain. So to delegate, it has to be a reliable win: I'm not here to make AI look good, I'm here to make my own performance be good, only a sure thing is a candidate for reflexive delegation.
AI companies advertise peak AI performance, users select AI tools on worst case AI fuckups: hence, only SOTA is ever in demand. TFA illustrates this well.
AI will be judged on it's worst performance, just like people are fired for their worst showing, not their best. No one cares about AI performance in ideal (read: carefully contrived) settings. We care how bad it fucks up when we take our eyes off it for 2 seconds.
Yeah, but the juiciest tasks are still far from solved. The amount of tasks where people are willing to accept low accuracy answers is not that high. It is maybe true for some text processing pipelines, but all the user facing use cases require good performance.
Yeah this is the thing people miss a lot. 7,32b models work perfectly fine for a lot of things, and run on previously high end consumer hardware.
But we're still in the hype phase, people will come to their senses once the large model performance starts to plateau
I expect people to come to their senses when LLM companies stop subsidizing cost and start charging customers what it actually costs them to train and run these models.
I mean, there is no reason for a inference provider og open models to subsidice you. And costs there is usually cheaper than Claude API pricing.
Its still a market though, there is always the incentive to subsidize if all the competition is keeping prices artificially low.
People don't want to guess which sized model is right for a task and current systems are neither good or efficient at trying to estimate that automatically. I see only the power users tweaking more and more as performance plateaus and the average user only changing when it's automatic.
> 7,32b models work perfectly fine for a lot of things
Like what? People always talk about how amazing it is that they can run models on their own devices, but rarely mention what they actually use them for. For most use cases, small local models will always perform significantly worse than even the most inexpensive cloud models like Gemini Flash.
Gemma 3n E4B has been crazy good for me - fine tune running on Google Cloud Run via Ollama, completely avoiding token based pricing at the cost of throughput limitations
What kind of applications are you using it for?
This is a place testing and benchmarking can definitely save you money.
It's the same as compute--you can skip testing and throw money at the problem but you're going to end up paying more.
We have some pretty basic guidelines at work and I think that's a decent starting point. They amount to a few example prompts/problem types and which OpenAI model to try using first for best bang for your buck.
I think some of it also comes down to scale. Buying a 5 pack of sledgehammers isn't a terrible value when everything comes in a "5 pack" and you only need <= 5 tools total. Or more practically, on the small end it's more economical to run general purpose models than tailor more specific models. Once you start invoking them enough, there's a break even and flip point where spending more time on the tailored or custom model is cheaper.
Completely agree. It’s worth spending time to experiment too. A reasonably simple chat support system I build recently uses 5 different models dependent on the function it it’s in. Swapping out different models for different things makes a huge difference to cost, user experience, and quality.
If there was an option to have Claude Opus guide Sonnet I'd use it for most interactions. Doing it manually is a hassle and breaks the flow, so I end up using Opus too often.
This shouldn't be that expensive even for large prompts since input is cheaper due to parallel processing.
You can define subagents that are forced to run on eg. Sonnet, and call these from your main Opus backed agent. /agent in CC for more info...
That's what I do. I used to use Opus for the dumbest stuff, writing commits and such, but now that' all subagent business that run on Sonnet (or even Haiku sometimes). Same for running tests, executing services, docker etc. All Sonnet subagents. Positive side effect: my Opus allotment lasts a lot longer.
I’m just sitting here on my $20 subscription hoping one day we will get to use Opus
You can just get your own account right? Just pay out of pocket.
generalist = fungible?
In the food industry is it more profitable to sell whole cakes or just the sweetener?
The article makes a great point about replit and legacy ERP systems. The generative in generative AI will not replace storage, storage is where the margins live.
Unless the C in CRUD can eventually replace the R and U, with the D a no-op.
> In the food industry is it more profitable to sell whole cakes or just the sweetener?
I really don't understand where you are trying to get. But on that example, cakes have a higher profit margin, and sweeteners have larger scale.
Isn't this just MoE?