The comments I see recommending selective use of cheaper models doesn't match the reality I experience working in the industry. I have the constant threat hanging over my head of being fired if I don't churn out code quickly enough. I'm not willing to gamble with my livelyhood by using a less effective model.
Saving money on tokens isn't something that's rewarded during performance reviews; particularly because it's difficult to quantify how much you saved versus hypothetically using a more expensive model.
I think quantifying tokens used is analogous to quantifying the amount of sawdust generated on a construction site.
Churning out useful code quickly is not solved by using more tokens per unit time. Most non-technical leaders can grasp this one and are likely more interested in the strategic game theoretical dynamics that are being forced by way of implied token consumption expectations (competition between developers).
If you want to hold out as long as possible and don't really care about anything other than the compensation package, you should at least play along with this new game in a half-assed manner. Try to goldilocks your token usage between any established extremes. You want to be in the statistical barycenter of every AI report that management can create.
To understand the token count thing - spending tokens is necessary and not sufficient to demonstrate that you are adopting AI.
Where we were 6mo ago is that a lot of big orgs realized they were behind, and needed some way of measuring if the tools were usable at all.
No sawdust at all on your job site, and you can tell nobody is cutting wood.
Now that tooling is more mature, you can measure things like % of diffs AI-generated, % of AI suggestions accepted vs edited, % of KB queries successful etc - all more useful than raw token count for quantifying how your org is using the tool.
So it’s a pragmatic metric that got a bit Goodhearted.
No sawdust is bad. But it's also bad if you cut all your boards into sawdust. Completely. Obliterated. No useful output, only sawdust.
% of AI suggestions accepted vs. edited is also a BS metric that Anthropic et. al. like to push, similar to LoC, because they're large numbers and large numbers must be good, right?
Well guess what, I have auto-accept on and then adjust after it's "done". And I do it by telling it what changes to make and those have auto-accept on as well. That's quite a high "accept" rate, by definition. But in reality it may have churned on 50% of the lines it generated and auto-accepted first.
> % of AI suggestions accepted vs. edited is also a BS metric
I disagree. It’s a valuable metric if you are building an agent / skill infra layer.
Think of it like error rate on your API. Green metric does not mean your system is healthy, but if it’s red you have an issue you definitely need to fix.
Your example scenario is detectable in the non-naive implementation anyway; the o11y layer (usually OTel these days) tracks the trajectories, links them to the diff, and attributes each hunk as coming from the session or not.
My feeling is it's not as bad of a metric as people think. Companies don't fully know the best way to use AI and things are changing rapidly, so you want people using a lot of tokens even on stuff that seems maybe kind of dumb on the surface, because if you find one useful thing and share it in the org that makes up for a lot of failures.
But I do think you also need to say, "To be clear, don't game the system. Any token usage that is even remotely justifiable as useful for the business is fine, and we will give you a lot of latitude. But if you're in the top 10% of token users, we are going to review your token usage, and if we find that you have a dozen agents perpetually running writing slam poetry, you're going to get fired."
NVidia will probably sue you for doing that, though.
Remember that the entire mantra of "productivity is a measure of how many shovels you break and replace" is only ever echoed by the one selling the shovels.
That sawdust analogy is fantastic!
We may be on the cusp of the AI age's new era of 'measure twice, cut once'.
Suddenly, LoC returned
With the rise of agentic coding, this has become a sign of quality for me in my own PRs and reviews: New features implemented in less than a thousand lines of productive code.
When I'm working on code that was heavily vibecoded, most of my PRs are reducing LoC by a couple hundreds of lines while fixing bugs or implementing a new feature.
My job kind of feels like being a garbage man, luckily my current employer appreciates it. Personally I think the current style of vibecoding only kinda works, because models are getting better fast enough to keep the shitpile from overflowing completely. Betting on the harnesses + models getting good enough to clean up after themselves is a bet, and I don't like gambling, but even I admit the odds don't seem to be bad.
Slowly and then suddenly :)
""" Steve Ballmer In IBM there's a religion in software that says you have to count K-LOCs, and a K-LOC is a thousand line of code. How big a project is it? Oh, it's sort of a 10K-LOC project. This is a 20K-LOCer. And this is 5OK-LOCs. And IBM wanted to sort of make it the religion about how we got paid. How much money we made off OS 2, how much they did. How many K-LOCs did you do? And we kept trying to convince them - hey, if we have - a developer's got a good idea and he can get something done in 4K-LOCs instead of 20K-LOCs, should we make less money? Because he's made something smaller and faster, less KLOC. K-LOCs, K-LOCs, that's the methodology. Ugh anyway, that always makes my back just crinkle up at the thought of the whole thing. """
From https://www.pbs.org/nerds/part2.html
So many times in my career I have seen a problem that could be handled with two lines of code and a table lookup being handled with 40 lines of code and a switch statement. So the guy writing the 40 lines of codes switch statement would get paid 20 times more money!
> I have the constant threat hanging over my head of being fired if I don't churn out code quickly enough.
And the tragedy is that this isn't sustainable, and we all involved deeply in tech know this. There is eventually going to be a big reality check the companies will have to pay, because you can't force creativity and quality, not even with AI, because actual intelligence lies with us at least for now and for the foreseeable future. However when the rope eventually snaps these executives at best will fall upwards, with big severance bonuses and a list of "contributions" we have to be grateful for. We are the ones that will suffer through the next big layoffs.
Unfortunately, I think this is correct. Such as it ever has been with technological change. The folks at the bottom bear the brunt of the dislocation and the folks at the top pat themselves on the back for being so forward looking and get huge payouts regardless of the actual results. Further, the folks at the top are always incentivized to go along with the herd of their peers because if it works then they were on the bandwagon, and if it doesn’t work, well then, how could they have known because “Everyone was deceived.”
> because if it works then they were on the bandwagon, and if it doesn’t work, well then, how could they have known because “Everyone was deceived.”
They call themselves "risk takers" to justify their high pay.
They are far too busy for that. They have pr people to say it for them.
Exemple from one of the wealthiest company in existance, for one of its most strategic product: I was trying gemini-cli on some mcp servers just yesterday, with gemini-chat helping me configuring everything. In less than 10 minutes, I stumbled upon 3 or 4 different bugs. Eventually, even gemini-chat recommended that I throw gemini-cli in the bin and move on to another agent... That's the new norm.
How much creativity do you need to fix bugs in corporate code? Almost zero. It’s maintenance, not creative work. Nothing against it, it’s needed, but let’s be real, would anybody be really sad if this work is overtaken by LLMs? I certainly won’t be, let them do it.
> How much creativity do you need to fix bugs in corporate code? Almost zero.
Have you seen the state of current corp software? I'd say a lot of creativity is still very much needed. Let's see how long this is sustainable.
> would anybody be really sad if this work is overtaken by LLMs?
I'd not be sad about the job itself, but the dev which had a mortgage to pay but now is substituted by a machine churning crap code while their superiors get sore from patting themselves on the back.
IBM system/360 OS had more than 50,000 bugs which could not be fixed because fixing any single bug would introduce two new bugs. I fear that a lot of AI software systems will reach the same crapware state as IBM system/360 very very soon!
I know from personal experience that once you fix a bug introduced by Claude, Claude tries to recreate the bug every time he edits that code again!!
Anyone (including ANTHROP\C) "recommending selective use of cheaper models" is spending costly human time (which costs more over time) on correcting the machine (which costs less over time). This is a bad trade.
In cost per line of code, we have verified this is always an error unless your time is worth less than the machine (unlikely unless you consider your time to have no cost rather than considering it as your hourly rate).
The worst thing for our productivity has been Claude Code or Claude Cowork taking a complex problem and turning around and writing bad instructions for dumb model agents then synthesizing the dumb answers into an orchestra of badness.
The single best fix for results-per-total-cost is to ensure it reads and thinks about whole content, not snippets, and thinks with the smartest model, not agents.
Agents should toil. Agents should neither think*, nor decide what to think about which itself is thinking.
* Agents should “think” like ants or bees or beavers think. Any human-like thinking, *especially* intuition-like thinking, should be thought by the best model available.
** Nobody should be “churning out code”. In a hierarchy of coders who translate detailed specs to some computer language, developers who write software that ships on a project timeline, and engineers who accomplish business goals, engineers should “churn out” engines structured for business outcomes.
Measured by that, the machine is leverage while reducing a variety of costs. At the same time, because most training data doesn't grok this, the machine doesn't grok it either. So it needs you to shape its toil.
I disagree heartily with everything here, both in personal experience from the models, and in values about coding.
I don't care bout cost, I care about getting good results fast.
Cost per line of code is not a suitable metric for anything. It's as silly as measuring engineers' performance by lines of code. More lines of code is worse than fewer lines of code. When you say "we have verified" whoever that "we" is makes a big difference, but you're posting pseudonymously, how are we to even guess at that "we"?
I get better results with some older cheaper models, faster. In particular older Claude models than Opus 4.7. Maybe the more expensive model churns out more lines, more complexity faster. That is a worse outcome for me. The complexity must be avoided at all costs. The simpler, smaller, answer is always better, and scales to bigger code bases. The more the model guesses at intent rather than checking intent, the more the model is clever rather than clear and simple, the worse the outcome, the more that the model turns into an architecture astronaut, the worse the outcome.
Yes, cost per line of code itself is an error.
Only cost for effective* outcome matters. And if your lines of code have a cost, you would want fewer lines of code to achieve the outcome, not more.
Are you sure you disagree with that?
* If your place of work starts talking "efficiency"**, run. Find somewhere the conversation is *effectiveness* — at the goal/outcome level.
** Not to mention that "efficiencies" is MBA speak for "right sizing" away effectiveness.
I’d point out that smaller and simpler also makes their router code easier to review and that fewer lines will have fewer bugs (on average) and those bugs will be more obvious. But then, I’m old school and won’t let an AI work on code without reviewing it, and I mostly write code by hand.
Too many people see wages as a sunk cost and a constant. One problem though is AI costs per task are unpredictable, and management tends to prefer predictable outcomes over optimal outcomes.
> The single best fix for results-per-total-cost is to ensure it reads and thinks about whole content, not snippets, and thinks with the smartest model, not agents.
I haven't seen "just absorb a giant ball of context and do the right thing the first time" be cracked yet, even for Opus 4.7.
At the end of the day, code is code, and we have decades of lessons about how to make code more reliable and maintainable. Composable small modules, not god methods, are still the way to go, and they reward devs who use them to get focused context for agents with faster - and often better - results.
I haven't seen "just absorb a giant ball of context and do the right thing the first time"
Exactly.
No more than sitting down and writing code before a product concept or spec or architecture comes out right the first time, or fifth.
Absorb the concept, make a shape of outcome, then a spec, then hold its hand to architect a series of iterations, either component by component or thin vertical slice or whatever combination lets you iterate in working increments...
Your brain, machine leverage. After all, it types faster than you. But it should type what you want.
You know what it should type, right? If you don't, you're gonna have a bad time anyway.
If you have such toxic environment, run.
If you’re sitting under a tree in the rain and it gets soaked through and you start getting wet, finding another tree won’t help you.
The whole industry is adjusting to the reality that the expected output of an engineer is much higher than it used to be. It’s not local to one company. You may find a better environment for the time being, but this is the direction everything is headed.
I don’t disagree that the expectations are higher, but token output hardly correlates to code output worthy of merging.
It doesn’t necessarily mean shipping faster either. Speeding up code production doesn’t mean it speeds up qa, compliance, and the litany of other things. Everyone seems to forget Amdahl’s law.
Code quality matters to engineers. Find a senior manager who cares. Or worse, find a customer who cares.
While they obviously want a high quality product, no outages, a responsive system etc, I don’t think they necessarily understand why you need to avoid creating god-objects, need to reason about abstractions, etc.
Code quality also exists on different axes. I've seen the case where code quality was poor in some aspects, e.g., tons of technical debt, coupling making it difficult to make changes, but overall product quality was very high. It had to be: it was a medical device.
Most environments only care about the output. In the case I'm thinking of, Software made it perfectly clear to Management, most of whom were former engineers, that the product desperately needed redesign in some ways. But as long as the cost of that redesign exceeded the cost to get the next version out, it could be postponed. This went on for years.
Code quality directly correlates to everything you describe.
Yea, but I’m not sure customers or mgmt get that
As code quality goes down, so does productivity, as it becomes more difficult to add new features and there are more bugs introduced.
Nobody cares until the code gets so twisted in knots that bugs and security issues predominate.
Exactly. As long as poor code quality doesn't make a difference in the actual usage of the product, no-one but the engineers will care.
On a task by task basis the code Claude generates is pretty good these days. The biggest issue I see is that it wants to rearchitect the code constantly and I have no faith in my tests anymore because Claude will just "fix" them
I think some tests should be considered to be part of the specification rather than the product.
Thats why they said they optimize for effective output at the cost of higher token use. They didn’t say they are intending to have high token use, instead thet implied its a second order effect of seeking more effective output.
They don't care about quality as long as it works enough. It's a clown show all the way through.
As one that does, it’s a difficult discussion to have with the executives. My peers look like their teams are producing more than my teams are and any argument along the lines of “but their code sucks” isn’t going to hold water. The executives care but until there’s actual impact or poor quality, it won’t matter, and it’s a lagging metric. Many still don’t care about technical debt and that’s been well understood in industry for a while.
It’ll take production incidents, impacted customers, and brand damage to make the executives start to prioritize quality over quantity again.
*the whole industry in countries without strong worker rights
American software engineers are paid commensurately more than equivalent roles in countries with strong worker rights. There is no free lunch.
Besides, it's probably counterproductive in the long run to think of strong worker rights as being opposed to the employer wanting higher productivity out of the worker.
Well, if we are talking the worldwide software development industry, FAANG-like salaries are a tiny exception. There are so many places without strong worker rights and without a high premium for workers.
The expectation of higher productivity measured by completely useless means, letting a highly qualified employee jump through hoops for the amusement and misconceptions of the C-level.
It’s too bad that, yet again, instead of the productivity gains leading to shorter work weeks, the benefits accrue to the companies. Just once I’d like to see productivity gains lead to more leisure time, not higher expectation.
Be careful what you are wishing for. All the leisure time you would want while having no job or money could be the future we are heading for.
Fair point. Though I don’t think time without money is really leisure time :)
Maybe once we get universal income we can start recommending this. Until then the individual isn't to blame when the only option to keep providing is to keep grinding in a toxic environment.
But I'd agree that everyone can start planning a career shift that'll span a few months to some years in order to seek better working conditions. Passively accepting all work degradation because that's life and money is needed is partly responsible for the current situation too.
Where to, that's the question. The economy is in the gutters and the replace-people-with-AI craze is making the issue even worse.
Perhaps for now. But you know, after working solid with AI for two years and adopting effective methods using detailed plans, and having a lot of success with it, here is the problem:
Coding faster leads to less understanding and higher long-term risk. Source-Code amnesia is real, and there’s a time requirement to really understand and appreciate what a system is actually doing.
I’ve been able to implement very large features using frontier models, but the code needs to always be revisited.
AI can do two things: find vulnerabilities, and prototype code. It cannot design software, and any appearance of such is an illusion at best.
We don’t need to produce faster to be successful, we need to create better, long lasting products.
> Coding faster leads to less understanding and higher long-term risk. Source-Code amnesia is real, and there’s a time requirement to really understand and appreciate what a system is actually doing.
This is why I have switched nearly all of my personal coding experiments over to Qwen3.6 27B. Opus make it easy to gloss over too much and to delegate too much. And so I don't build sufficient memory of the code to provide long-term oversight.
But Qwen3.6 27B sits on an really interesting balance point. It understands code well enough to get 80% of the way to a good design, and it can fully implement a well-specified feature. But if my understanding of the code starts to weaken, things start going wrong much more quickly than they do with Claude.
Opus will happily take complex code beyond the point of salvation, if you allow it. I'm currently cleaning up a successful prototype code base right now, one that was partially vibe-coded and now needs to be put into production. And Opus generated massive amounts of tech debt. So clearly people who lean into vibe coding will need to keep upgrading their models for many years to keep up with the mess created by earlier models.
Strong agree (although I'm on Qwen3.6-35B-A3B, with 6-bit quant.). If you're a programmer, it gets the job done. When I occasionally don't want to care about the code, I switch over to DeepSeek V4 Pro.
Opus is relegated to the planning / design phase.
> It cannot design software, and any appearance of such is an illusion at best.
Have you tried Claude Opus 4.7?
Yes I use Opus 4.7 regularly as my daily AI tool. It can do incredible things for sure, but more in the sense of pure intellect not much in “emotional” or “creative” intelligence.
For example you might have a great design/architecture session and then run out of context. The next agent tries to piece things together from fragments of conversation and such. But it often starts going off on tangents, searching overly broad to understand, misses cues and nuance, all-the-while burning tokens.
As other articles have put it: AI makes doing the easy things easier and the hard things harder. Because hard things require creativity.
To bring this back to the original post: companies need people, and they shouldn’t expect that they can fire half their workforce and replace it with AI. Quite the contrary. The faster companies move with AI the more technical debt they’ll end up with it’s a guarantee.
“If you want to travel fast, go alone. If you want to travel far, go together.”
Now as you can see from the article, it starts turning. People are getting less pricey than agents on API pricing.
Copilot switches to API pricing starting next month (let's see how long it will last for our $39, and $19 since September), Anthropic switches all corps into API based pricing. From the most popular choices I think only Codex didn't switch yet (although it is hard to tell because I don't know their enterprise pricing).
The Chinese models are going to look really attractive.
I have DS-V4-Pro agents pretty much running 24/7. The cost is inconsequential. The same cannot be said for anything from Anthropic.
> The economy is in the gutters
Consumer sentiment is in the gutters certainly. But objective measures of the economy like unemployment and real wages look good to excellent
https://fred.stlouisfed.org/series/UNRATE
https://fred.stlouisfed.org/series/LES1252881600Q
What is in the gutters is memories of 2008-2010. That was the last time folks experienced a bad economy. I remember Ed Elson saying something along the lines of "who cares about employment, what matters is inflation". Sure, if you're 27, you haven't got a clue what a bad economy looks like.
Unemployment and CPI : The most false statistics on the planet. Instead, look at employment population ratio 25-54, and core inflation. That will FREE YOUR MIND.
It's easy when you can just lie. The data from phone surveys is increasingly divergent from the delayed payroll data.
> But objective measures of the economy like unemployment and real wages look good to excellent
Oh hell no, ever since the tail end of Biden the trend for unemployment is showing upwards when corrected for seasonal effects [1], and for real wage growth the situation has been worse for an even longer time [2] - if not for the effects of the post covid stimulus packages plus emergency wage raises following the energy cost explosion thanks to the Russian invasion of Ukraine.
The story the stonk markets tell is completely decoupled from reality, partially because the AI wash trading bubble keeps distorting the statistics, partially because no matter what the stonk markets only can grow up because pension contributions keep blowing up the market [3]. Not getting that difference was what blew up Biden's reelection and is now screwing over Trump.
[1] https://www.bls.gov/charts/employment-situation/civilian-une...
[2] https://www.atlantafed.org/research-and-data/data/wage-growt...
[3] https://news.ycombinator.com/item?id=48233492
And open positions are simply because someone decided to run from that place
This, I happily used the opus 4.6 fast mode to the tune of 5k for a project. The delivery of the project justified the 5k, if I only spent 500 but delivered the project 1 month later - I would have been in the dog house.
Your project cost $5k in tokens? How does that work? over what time? My understanding is that most developers are given pro max plans at $200/m and are expected to max that out.
I've been getting by on the $200/year plan by smoothing usage continuously over time.
The pay per use is for the API so does it mean you're using the API in a custom setup?
My real comment is, why were they not just using their self-hosted copies of it? Do they pay back Anthropic for use of it in Azure? Broker a deal, let Anthropic charge you drastically less to use their model AND Anthropic could have made Claude Code work directly with Azure for Microsoft employees. Pennies on the dollar, and Microsoft could do it using low use GPUs to save on cost, or stack underused GPU compute (this is how serverless was born btw - its the unused resources in a web server somewhere).
When you consider that xAI's old data center was enough to bring Anthropic back ahead, it tells me Microsoft could host their own on underutilized previous gen GPUs that are sitting there wasting server real estate.
> The comments I see recommending selective use of cheaper models doesn't match the reality I experience working in the industry. I have the constant threat hanging over my head of being fired if I don't churn out code quickly enough. I'm not willing to gamble with my livelyhood by using a less effective model.
I don't buy it. Old models such as GPT4.1 were faster than newer reasoning models, and their output was as good. Newer models end up wasting an ungodly amount of time with chain-of-thought steps which can be a complete waste of time if you have a structured prompt such as a plan or a spec.
My experience in the real world is that users have to ration requests, and x0 models actually tend to be used far more because expensive models are left for more complex tasks.
Are you saying you found GPT 5.5 to be as good as 4.1 for coding?
[dead]
This, if you’re high performing, the company won’t question your use of tokens. If they want to limit it, they have ways to set limits on spend and usage.