I think quantifying tokens used is analogous to quantifying the amount of sawdust generated on a construction site.

Churning out useful code quickly is not solved by using more tokens per unit time. Most non-technical leaders can grasp this one and are likely more interested in the strategic game theoretical dynamics that are being forced by way of implied token consumption expectations (competition between developers).

If you want to hold out as long as possible and don't really care about anything other than the compensation package, you should at least play along with this new game in a half-assed manner. Try to goldilocks your token usage between any established extremes. You want to be in the statistical barycenter of every AI report that management can create.

To understand the token count thing - spending tokens is necessary and not sufficient to demonstrate that you are adopting AI.

Where we were 6mo ago is that a lot of big orgs realized they were behind, and needed some way of measuring if the tools were usable at all.

No sawdust at all on your job site, and you can tell nobody is cutting wood.

Now that tooling is more mature, you can measure things like % of diffs AI-generated, % of AI suggestions accepted vs edited, % of KB queries successful etc - all more useful than raw token count for quantifying how your org is using the tool.

So it’s a pragmatic metric that got a bit Goodhearted.

> % of AI suggestions accepted vs edited

this has to be the worst metric.

anytime the llm wants me to read a diff of one file, im just gonna send it forward so i can read the whole diff

No sawdust is bad. But it's also bad if you cut all your boards into sawdust. Completely. Obliterated. No useful output, only sawdust.

% of AI suggestions accepted vs. edited is also a BS metric that Anthropic et. al. like to push, similar to LoC, because they're large numbers and large numbers must be good, right?

Well guess what, I have auto-accept on and then adjust after it's "done". And I do it by telling it what changes to make and those have auto-accept on as well. That's quite a high "accept" rate, by definition. But in reality it may have churned on 50% of the lines it generated and auto-accepted first.

> % of AI suggestions accepted vs. edited is also a BS metric

I disagree. It’s a valuable metric if you are building an agent / skill infra layer.

Think of it like error rate on your API. Green metric does not mean your system is healthy, but if it’s red you have an issue you definitely need to fix.

Your example scenario is detectable in the non-naive implementation anyway; the o11y layer (usually OTel these days) tracks the trajectories, links them to the diff, and attributes each hunk as coming from the session or not.

Not the one down-voting you btw. Disagreeing is fine by me.

I would ask you tho: What incentive do AI vendors have to even try and detect this? It's in their interest to use the most naive interpretation, i.e. what my original comment mentioned, as it shows how "good" their models are, coz nobody ever changes much if anything ;)

Never mind that they really can't unless they're going "creepy mode". If I use Claude/Codex et. al. to agentically write something, then let the session just sit while I go about in my IDE changing things and then I commit and push, are you telling me that the vendors do or should track all of the changes made to the files they touched and report back to base what got overridden by me, the human?

My feeling is it's not as bad of a metric as people think. Companies don't fully know the best way to use AI and things are changing rapidly, so you want people using a lot of tokens even on stuff that seems maybe kind of dumb on the surface, because if you find one useful thing and share it in the org that makes up for a lot of failures.

But I do think you also need to say, "To be clear, don't game the system. Any token usage that is even remotely justifiable as useful for the business is fine, and we will give you a lot of latitude. But if you're in the top 10% of token users, we are going to review your token usage, and if we find that you have a dozen agents perpetually running writing slam poetry, you're going to get fired."

NVidia will probably sue you for doing that, though.

Remember that the entire mantra of "productivity is a measure of how many shovels you break and replace" is only ever echoed by the one selling the shovels.

That sawdust analogy is fantastic!

We may be on the cusp of the AI age's new era of 'measure twice, cut once'.

Suddenly, LoC returned

With the rise of agentic coding, this has become a sign of quality for me in my own PRs and reviews: New features implemented in less than a thousand lines of productive code.

When I'm working on code that was heavily vibecoded, most of my PRs are reducing LoC by a couple hundreds of lines while fixing bugs or implementing a new feature.

My job kind of feels like being a garbage man, luckily my current employer appreciates it. Personally I think the current style of vibecoding only kinda works, because models are getting better fast enough to keep the shitpile from overflowing completely. Betting on the harnesses + models getting good enough to clean up after themselves is a bet, and I don't like gambling, but even I admit the odds don't seem to be bad.

Slowly and then suddenly :)

""" Steve Ballmer In IBM there's a religion in software that says you have to count K-LOCs, and a K-LOC is a thousand line of code. How big a project is it? Oh, it's sort of a 10K-LOC project. This is a 20K-LOCer. And this is 5OK-LOCs. And IBM wanted to sort of make it the religion about how we got paid. How much money we made off OS 2, how much they did. How many K-LOCs did you do? And we kept trying to convince them - hey, if we have - a developer's got a good idea and he can get something done in 4K-LOCs instead of 20K-LOCs, should we make less money? Because he's made something smaller and faster, less KLOC. K-LOCs, K-LOCs, that's the methodology. Ugh anyway, that always makes my back just crinkle up at the thought of the whole thing. """

From https://www.pbs.org/nerds/part2.html

So many times in my career I have seen a problem that could be handled with two lines of code and a table lookup being handled with 40 lines of code and a switch statement. So the guy writing the 40 lines of codes switch statement would get paid 20 times more money!