I think veteran engineers have always known that the real problems with velocity have always been more organizational than technical. The inability for the business to define a focused, productive roadmap has always been the problem in software engineering. Constantly jumping to the next shiny thing that yields almost no ROI but never allowing systemic tech debt to be addressed has crippled many company's I have worked at in the long-term.
> The inability for the business to define a focused, productive roadmap has always been the problem in software engineering.
Agreed, and I also agree that most developers come to this realization with time and experience. When you have a clear understanding of business rationale, scope, inputs, and desired outputs, the data models, system design and the code fall out almost naturally. Or at least are much more obvious.
For veterans engineers that might be true. But for a junior engineer pre-AI, velocity has always been technical. I know junior engineers who after a whole year of writing C++ still does not grok std::unique_ptr; and this person consistently has the least velocity on their whole team. When I used to write performance reviews for junior engineers, their performance really was dominated by their velocity, which was roughly measured in lines of bug-free code written within a time period. A good junior engineer would be given a clearly defined feature and write good code quickly, whereas a worse junior engineer would be given the same thing and write code slowly or write buggy code quickly that required so much work debugging and rewriting.
Being a junior engineer was a good time!
> I think veteran engineers have always known that the real problems with velocity have always been more organizational than technical.
I don't think this comment is fair or grounded. There are plenty of process bottlenecks that are created by developers. Unfortunately I have a hefty share of war stories where a tech lead's inability to draft a coherent and clear design resulted in project delays and systems riddled with accidental complexity required to patch the solution enough to work.
Developers are a part of the process and they are participants of both the good parts and the bad parts. If business requirements are not clear, it's the developer's job to work with product owners to arrive at said clarity.
> Unfortunately I have a hefty share of war stories where a tech lead's inability to draft a coherent and clear design resulted in project delays and systems riddled with accidental complexity required to patch the solution enough to work
This is also an organizational problem (bad hiring/personal management). If you put an incompetent individual at the helm of a project, then resources (especially time) will be spent horrendously and you will have more problems down the line. That’s true for all type of organizations and projects.
- systemic tech debt is now addressable at scale with LLMs. Future models will be good enough to sustain this, if people don’t believe this I would challenge them to explain why. First consider if you understand what scaling laws are like chinchilla and how RL with verification works fundamentally
- I completely agree with you about fundamentally the limitation being the business able to coherently articulate itself and its strategy
- BUT the benefit now is you can basically prototype for free. Before we had to be extremely careful with engineer headcount investment. Now we can try many more things under the same time constraints.
The problem with tech debt is not that it is some poorly designed code in a few repositories that can just be changed. True tech debt is the kind that requires significant architectural changes across many systems and is almost always coupled with major data migrations. You need the rest of the business to agree that you want to invest all that time and energy to fix a problem someone else created 10 years ago. You likely will also need other teams to set aside time on their own road map to address it. You also might need customers to change what they are doing because if software lets you do something, you can guarantee that someone has learned to do it - even if that 'something' was actually a bug.
LLMs don't solve any of those problems by itself.
> BUT the benefit now is you can basically prototype for free.
But.. so can your competitors. And that changes the value proposition.
How do you mean?
> systemic tech debt is now addressable at scale with LLMs.
Is there any reason to believe this? I've only seen the evidence of the contrary so far.
My experience with AI coding aides is that they, generally:
1. Don't have an opinion.
2. Are trained on code written using practices that increase technical debt.
3. Lack in the greater perspective department, more focused on concrete, superficial and immediate.
I think, I need to elaborate on the first and explain how it's relevant to the question. I'll start with an example. We have an AI reviewer and recently had migrated a bunch of company's repositories from Bitbucket to GitLab. This also prompted a bunch of CI changes. Some projects I'm involved with, but don't have much of an authority, that are written in Python switched to complicated builds that involve pyproject.toml (often including dynamic generation of this cursed file) as well as integration with a bunch of novelty (but poor quality) Python infrastructure tools that are used for building Python distributalbe artifacts.
In the projects where I have an authority, I removed most of the third-party integration. None of them use pyproject.toml or setup.cfg or any similar configuration for the third-party build tool. The project code contains bespoke code to build the artifacts.
These two approaches are clearly at odds. A living and breathing person would either believe one to be the right approach or the other. The AI reviewer had no problems with this situation. It made some pedantic comments about the style and some fantasy-impossible-error-cases, but completely ignored the fact that moving forward these two approaches are bound to collide. While it appears to have an opinion about the style of quotation marks, it completely doesn't care about strategic decisions.
My guess as to why this is the case is that such situations are genuinely rarely addressed in code review. Most productive PRs, from which an AI could learn, are designed around small well-defined features in the pre-agreed upon context. The context is never discussed in PRs because it's impractical (it would usually require too much of a change, so the developers don't even bring up the issue).
And this is where real large glacier-style deposits of tech debt live. It's the issues developers are afraid of mentioning because of the understanding that they will never be given authority and resources to deal with.
You are not wrong about anything you’re saying but like I said this misses the forest for the trees. I’m talking about like the next ~2 years. There is a common idea that we don’t understand this technology or what will happen performance wise. We know a lot more about what’s going to happen than people think. It’s because none of this is new. We’ve known about neural nets since the 40s, we know how RL works on a fundamental level and it has been an active and beautiful field of research for at least 30-40 years, we know what happens when you combine RL with verifiable rewards and throw a lot of compute at it.
One big misconception is that these models are trained to mimic humans and are limited by the quality of the human training data, and this is not true and also basically almost entirely the reason why you have so much bullishness and premature adoption of agentic coding tools.
Coding agents use human traces as a starting point. You technically don’t have to do this at all but that’s an academic point, you can’t do it practically (today). The early training stages with human traces (and also verified synthetic traces from your last model) get you to a point where RL is stable and efficient and push you the rest of the way. It’s synthetic data that really powers this and it’s rejection sampling; you generate a bunch of traces, figure out which ones pass the verification, and keep those as training examples.
So because
- we know how this works on a fundamental level and have for some time
- human training data is a bootstrap it’s not a limitation fundamentally
- you are absolutely right about your observations yet look at where you are today and look at say Claude sonnet 3.x. It’s an entire world away in like a year
- we have imperfect benchmarks all with various weaknesses yet all of them telling the same compelling story. Plus you have adoption numbers and walled garden data that is the proof in the pudding
The onus is on people who say “this is plateauing” or “this has some fundamental limitation that we will not get past fairly quickly”.
> look at say Claude sonnet 3.x. It’s an entire world away in like a year
In the area I work I find them to be of very little value both then and now... I see no real difference. They help in marginal tasks. Eg. they catch typos, or they help new programmers to faster explore the existing codebase.
So far, I haven't used a single line of code generated by AI, even though I've seen thousands. Some of them worked to draw attention to a problem, but none solved it successfully. It was all pretty lame.
I see no reason to believe it's going to get better. Waving hands more forcefully isn't helping, there's no argument behind the promise of "it will get better". No reason to believe it will...
But, more importantly, the AI is applied on a level where really important things don't happen. It's automating boilerplate work. It doesn't make decisions about the important parts. Like, in the example above, the AI is not capable of choosing a better strategy: use pyproject.toml or write code to build Python packages? It's not the kind of decision it's called to make and nobody sensible would trust it to make such a decision because there isn't a clear right or wrong answer, only the future will prove one or the other to be the right call.
I think if you honestly don’t believe there is a major difference between 3.x and 4.7 I don’t think there is much anyone will be able to do to convince you. I do find it disappointing when technical professionals are so disinterested in building a real understanding of a fairly complex topic.
> I see no reason to believe it's going to get better. Waving hands more forcefully isn't helping, there's no argument behind the promise of "it will get better".
That’s a real bummer to read that from someone who sounds like a professional, and not only a professional but someone thoughtful and smart. 30 years of brilliant work in RL, Bayesian stats, machine learning, measurement, and then trillions of dollars of funding and some of the best talent in the world, and your assertion is “I tried it on my codebase and I didn’t like it and that trumps literally entire fields of mathematics and statistics”. I mean, have you heard of Chinchilla scaling laws? Do you know how RL works? are you aware of benchmarks, their strengths and weaknesses? Are you following adoption numbers, accomplishments like new proofs of unsolved erdos problems?
> But, more importantly, the AI is applied on a level where really important things don't happen. It's automating boilerplate work.
Your experiences are your experiences, I don’t know what work you do or how it gets done, what languages you’re working with etc. but literally we’re at the point where the vast majority of code at major tech companies is fully AI written (not assisted).
> It's not the kind of decision it's called to make and nobody sensible would trust it to make such a decision because there isn't a clear right or wrong answer
What are you claiming is not fundamentally possible for an AI to do that a human can do here? People make judgement calls on ambiguous problems, taking into account vast amounts of context about the business, dev time, reliability, maintenance, etc; why do you think AI can’t do that?
What's up with the buzzword bragging?
You don't know buzzword A, B, C? Heh, he must be incompetent and know nothing.
The buzzwords mean nothing, really. The math is the same for a stupid or a smart model, because the model is trying to mimic properties of the training dataset.
You can give me the ultimate model architecture that will beat every model in existence and I can still figure out a way to make it perform worse than what's available today, but you're not even doing that, you're just drumming up some old news.
If someone "threatened" me with tech advancements I would be more worried about things like an imminent massive drop in token costs for bigger context windows or other game changers like continual learning where the model internalizes your code base into its weights rather than just keeping it in its context.
It’s not buzzword bragging they are the prerequisites to having an coherent conversation. If someone doesn’t know what chinchilla scaling laws are the discussion about “I think things are saturated” is not grounded in anything. It’s like sitting around debating quantum mechanics and you don’t know the math, it’s just meaningless. If these sound like buzzwords the implication is not “you’re an idiot” it’s “you are not yet informed on the key basics of the discussion” and that is something you can fix with curiosity and a couple of prompts to ChatGPT to speed up the learning curve. It’s not like any of this stuff is gatekept.
> You can give me the ultimate model architecture that will beat every model in existence and I can still figure out a way to make it perform worse than what's available today, but you're not even doing that, you're just drumming up some old news.
Sorry I don’t understand what you’re saying here — what is the old news? You can break new models — yes. What’s the point you are trying to make here?
> If someone "threatened" me with tech advancements I would be more worried about things like an imminent massive drop in token costs for bigger context windows or other game changers like continual learning where the model internalizes your code base into its weights rather than just keeping it in its context.
I also don’t really know the point you’re trying to make here — like token cost drops seem like a good thing? Bigger context window too? Are we saying the same thing here?
> So far, I haven't used a single line of code generated by AI, even though I've seen thousands. Some of them worked to draw attention to a problem, but none solved it successfully. It was all pretty lame.
I find this statement highly suspect. AI coding agents nowadays can spot subtle object lifetime management issues and even dependency lifecycle incompatibilities, and here you are stating you are unable to use them to fix things? How strange.
Not to mention that coding agents excel at creating greenfield projects and migrating whole frameworks.
But if you feel you can't use them then I feel sorry for you.
>- systemic tech debt is now addressable at scale with LLMs. Future models will be good enough to sustain this, if people don’t believe this I would challenge them to explain why.
Is this some sort of troll attempt? Like, are you fundamentally misunderstanding the problem with tech debt? This is the equivalent of throwing garbage on the floor and expecting professional cleaners to keep your house clean.
You can produce tech debt faster than you can pay it back, that's the core aspect of tech debt. If tech debt was more expensive in the short term than not doing it, nobody would be doing it.
A labor saving device doesn't reduce or deal with tech debt since tech debt is a decision made independently of the competence of the developers. If you have a company with a tech debt culture, the labor saving device will just let you accumulate more tech debt until you reach the same level of burden per person.
>First consider if you understand what scaling laws are like chinchilla and how RL with verification works fundamentally
Honestly, this tells me that you basically understand nothing, not even chinchilla scaling laws and how RL works. Not only are you trying to brute force the problem, you're listing completely irrelevant factors to the problem at hand.
Chinchilla scaling laws are "ancient" by LLM standards. Everyone who designs a model architecture that is supposed to beat their competitors is pulling out every trick in the books and then come up with their own on top of that and chinchilla scaling laws have been done to death in that regard.
Reinforcement Learning is also a pretty bad example here, because there is no obvious way to encode a reward function to deal with something as ill defined as tech debt. You didn't even say avoid tech debt which would be actionable to some extent, just "systemic tech debt is now addressable at scale with LLMs". I.e. you're implying that if LLMs were to generate tech debt, you can just keep scaling and produce more of it, solving the problem once and for all Futurama style with ever bigger ice cubes.
- Not a troll.
Both of these lectures misunderstand my point and how things work.
- “tech debt” is not some special problem…? You accumulate cruft and bad design decisions…you spend tokens to fix this. Is your point there is always a fundamental tension between spending tokens on new stuff and spending tokens on cleaning stuff?
> Honestly, this tells me that you basically understand nothing, not even chinchilla scaling laws and how RL works. Not only are you trying to brute force the problem, you're listing completely irrelevant factors to the problem at hand.
That’s a very interesting take because I would say the same thing! RL and scaling laws are not relevant to the performance and capabilities of coding agents? Thats something you don’t hear everyday
- chinchilla-like scaling laws are not ancient…people try to derive scaling laws for new paradigms all the time it is how researchers get their company/lab to invest in scaling up a new idea. No idea what you mean here. Maybe you think I meant “the literal constants from the chinchilla paper”? No I mean: scaling laws generally, and Chinchilla, due to the impact of that work, is used more generally. Regardless, scaling laws generally continue to hold, and in fact improve with architectural/data mix/training recipes.
> Reinforcement Learning is also a pretty bad example here, because there is no obvious way to encode a reward function to deal with something as ill defined as tech debt.
Well that’s a bit of a strong claim to make… I don’t agree with this at face value but even if I did, you don’t need to explicitly do RL on tech debt as a specific task.. you do RL to build better programming skills generally which then generalize to many coding tasks.
> You didn't even say avoid tech debt which would be actionable to some extent, just "systemic tech debt is now addressable at scale with LLMs".
Tech debt is strategic, why avoid it?
> you're implying that if LLMs were to generate tech debt, you can just keep scaling and produce more of it, solving the problem once and for all Futurama style with ever bigger ice cubes.
I’m saying you can take, successively over time larger and larger, and more complex codebases with thorny debt problems and resolve them by spending money on tokens.
You keep scaling and, just like we do today, decide when some tech debt austerity needs to take place. I’m saying “the guy that built our house of cards over 10 years and left” is no longer so devastating and expensive a problem as it was before
> the real problems with velocity have always been more organizational than technical
If you go back far enough to the time when and one-offs and all programs were written from scratch, I doubt that. https://www.cs.utexas.edu/~EWD/transcriptions/EWD03xx/EWD340...:
“the programmer himself had a very modest view of his own work: his work derived all its significance from the existence of that wonderful machine. Because that was a unique machine, he knew only too well that his programs had only local significance and also, because it was patently obvious that this machine would have a limited lifetime, he knew that very little of his work would have a lasting value”
I think technical debt started to be somewhat of an issue somewhere in the early 1970s, maybe a few years earlier.
> [O]rganizations which design systems (in the broad sense used here) are constrained to produce designs which are copies of the communication structures of these organizations.
— Melvin E. Conway 1967
Any competent engineer should understand that engineering is just the assembly line side of product development. Deciding when to release which feature, bug fixes, etc. and the development/management of the product in general has always been the real challenge, and a lot of the strategy involved in doing this relies on feedback loops that AI cannot speed up. Though at the same time I do feel like leaders on the business side often scapegoat engineer's speed as an excuse instead of taking responsibility for poor decisions on their end.
I get what youre trying to say but this is actually a bad picture to defend. product and engineering should go hand in hand, with one side informing the other. Engineers sctually giving a shit about a product will tell product possibilities they havent even considered, product people caring about engineering will not propose utterly stupid things. and I for one can spot when a product is well designed but poorly made, as well as when a product is perfectly crafted yet useless. the sweetspot is both. and even with the speed multiplier of AI, having a proud in the craft and being actually good in it as an engineer makes a night and day difference for the final result.
yes, most places I have worked were hobbled by the organizations being completely idiotic.
which is why engineers want to be left alone to code, historically. Better to be left alone than dealing with insane bureaucracy. But even better than that is working with good bureaucracy. Just, once you know it's insane, there's not really anything that you can personally do about it, so you check out and try to hold onto a semblance of sanity in the realm you have control over, which is the code.
> there's not really anything that you can personally do about it
Small companies/startups don't have insane bureaucracy, and they're hiring.
They also expect you to do 4 roles in one while low-balling you because you are not in the USA.
I wish the reality was more pleasant. It's not.
It’s part of the problem but AI also can crush this on pure lines of code and functionality alone. It can put out 100,000 lines of somewhat decent code in a day. That usually takes months or years of manual coding for a team.
More lines of code doesn’t help adding more constraints to a system without violating the existing ones.
In fact, it makes it harder.
It’s not just verbose code. I’m talking about 100,000 lines of relatively decent feature code that isn’t bloated.
There is a reason that kLOC / FP were rightly shunned out of being measurable metrics years ago. The same clown show seems to be resurging with "tokens". There is, in my opinion, no real formula or metric that you can define for "good" code or "bad" code. Tickets and ceremonial activities, however abstract that into a N-nary status value that seems easier to judge upon.
[dead]
And now they're almost forcing us to produce machine-made tech-debt at an industrial scale. The AI craze isn't going to produce the boon some people think it will. And the solution? More AI, unfortunately.
> And the solution? More AI, unfortunately.
I think the solution to using AI in coding is more testing, which unlocks even more AI.
The solution truly is more AI, yes.
> AI craze isn't going to produce the boon some people think it will.
What’s the boon you don’t think it will produce?
No. It's not more AI. The solution is designing and sticking to development process that is more resilient to errors than the one that's currently happening. This isn't a novel idea. Code reviews weren't always part of the process, neither was VCS, nor bug tracker etc.
The way AI is set up today, it's trying to replicate the (hopefully) good existing practices. Possibly faster. The real change comes from inventing better practices (something AI isn't capable of, at least not the kind of AI that's being sold to the programmers today).
What better practices do you mean? Are you saying we just need different more agentic-friendly practices that ensure scaled reliability beyond what we can manually check? If so I totally agree.
AI is 100% capable fundamentally of making new processes. Look I mean it’s not like I think opus 4.7 is all you need, but how can you argue with the fact that adoption since 4.5 has been an inflection point? That’s kind of proof that reliability has reached a level that serious usage is possible. That’s over a period of months. When you zoom out further you see this is extremely predictable even a few years ago, despite the absolute hissy fits thrown on HN when CEOs began saying this.
Agentic coding is verifiable and this implies there are very few practical limits to what it can do. Combine that with insanely active research on tackling the remaining issues (hallucinations — which are not a fundamentally unsolvable problem at a practical level, context rot, continual learning etc)
> What better practices do you mean?
I literally listed examples above... Code reviews weren't the norm until some time around 2010-ish. Then programmers realized that reviews help improve the code quality, and, eventually, this became so popular that today virtually everyone does it.
Anyways, I'll give an example from something that I've personally experienced / contributed to, which isn't as massive of a thing as code reviews, but is in the same general category.
Long ago, Git didn't have --force-with-lease option. Few people used `git rebase` command because of that (the only way this would work is if using it later with --force, which could destroy someone else's work). In the company I worked at the time, we extended Git to have what was later implemented as --force-with-lease. Our motivation was the need for linear history and some other stricter requirements on the repository history (s.a. every commit must compile, retroactive modifications in response to tests added later etc.)
This is an example of how a process, that until then was either prone to accidental loss of programmer's work or would result in poorly organized history was improved by inventing a new ability. This is also an example of something AI doesn't do, because, at its core, it's a program that tries to replicate the best existing tools and practices. It won't imagine a new Git feature because it has no idea what it could possibly be because its authors don't know that either.
> opus 4.7 is all you need, but how can you argue with the fact that adoption since 4.5 has been an inflection point?
What did it invent?
Right no I understand what you mean, I asked to be sure and you’ve confirmed my understanding.
I think we’re talking past each other because your comment is like 99% interesting and insightful and also I agree with it completely but there is only one part of your claim that I have an issue with which is
> It won't imagine a new Git feature because it has no idea what it could possibly be because its authors don't know that either.
I left comments in other threads with a lot of detail but this is a fairly common misconception. It is true in a sort of practical sense today, and I have many experiences as you do with respect to this, but the gist is: this is a world of RL with verifiable rewards, you are not bounded by human ability at all and that is why we have the adoption, funding, and frothy excitement. It is not simply mimicking human coding. In early stages it will because human programming traces are used as kind of a bootstrap to get to an RL phase without any limitation on performance. This is a very well studied field and it just isn’t that much of a question of if and now it’s not even really a question of when.
> What did it invent?
This is a perpetual question with constantly moving goal posts so I’ve given up convincing anyone but by now it’s solving unsolved Erdos problems, not sure how convincing you find that (not opus though but that hardly matters now)
The point I’m trying to make is: we aren’t there yet but it’s a crazy idea to think that isn’t imminent given all of the measurement and observations we have.
Additionally my point on 4.5 being a turning point is adoption. You wouldn’t see adoption numbers if we were not accelerating rapidly from say 3.x performance along the scaling trend that we’ve known for years now