I feel like OP is still in the year 2025.

> The AI will have gone off the rails multiple times and you will only notice it later when you actually try to use the software.

Except that said AI can now themselves use your software and find and fix bugs themselves, not to mention drive new features.

>Your agent might go “off the rails” and start doing something you don’t want it to do

This happens but far less often than it used to, and the case for full autonomous agents is getting stronger, not weaker.

>It is humanly impossible to build your own understanding of a codebase

This again feels outdated. I think we're mving towards humans no longer needing to understand a codebase, and letting AI drive it.

> I think we're mving towards humans no longer needing to understand a codebase, and letting AI drive it.

The AI companies are incentivized to push this kind of reckless slopmaxxing - the end result is that your business is totally dependent on them and your product's value entirely sourced from them. And a lot of people are buying it, but I think it's a silly fad.

You're right, understanding things is so 2025.

> I think we're mving towards humans no longer needing to understand a codebase, and letting AI drive it.

I can see this being true for non-critical software like entertainment, media, and so on.

Definitely not true for systems where security stakes are high. Like banking, aviation, defense, etc.. AI will surely contribute but not independent of human engineering understanding.

In all those fields you mentioned, they have a lot of strict compliance measures and it is highly unlikely that AI will just be able to take over. Ironically almost all of aviation code is actually machine-generated using things like Simulink

> Except that said AI can now themselves use your software and find and fix bugs themselves, not to mention drive new features.

Anyone with sufficiently good taste in how to program effectively and architect will disagree with you on this. The short leash method is how you ensure good results when you're functioning outside of the training data. If you're even a modestly above average programmer this is afaik the only way to ensure fast, quality development with LLMs.

> This again feels outdated. I think we're mving towards humans no longer needing to understand a codebase, and letting AI drive it.

I think you are perhaps unaware of a world of programming where AI is still woefully inept. I have observed very consistently in all languages with manual memory management frequent issues with handling it. Trust me, it's not as simple as sticking it in a loop with Valgrind.

> This happens but far less often than it used to, and the case for full autonomous agents is getting stronger, not weaker.

This is that I do not see. My journey, just couple weeks ago, Claude Code + Opus 4.8. The task was not too complicated, 4 new API endpoint plus events streamed from client by websocket.

1. Multiply iterations on API definitions, refine request/response models, database schema, whole flow. A lot of corrections, removing contradictions, manual changes in document. Opus went of rails all the time. 500+ lines final document

2. API Integration tests. Once again, back and forth. AI was unable to create tests directly from document, so 2 iterations: Create placeholders with Given-When-Than comments, review an correct by hand. Second iteration was to implement tests. A lot of mistakes corrected after review.

3. Implementation. CC got api document, working tests ( modifications blocked by hook ), 6+ "best practices" skills ( most promptly ignored ), "rubber duck" and "code simplifier" agents, pre cooked scipts to run tests, linter, and check for compilation errors. Plan + execution + review, multiply corrections on the way. Feature implemented, all tests passed.

4. Code review. At average, found one issue per 20 lines of code. Not count code style, things like: Use in memory semaphore in kubernetes service (deployment described in CLAUDE.md ), 8 database calls to update the same record during a single request. One column at a time! Read-modify-save without transaction. Mistakes in business logic, failure recovery, authorization.

The result: almost one workweek, $100+ in tokens, and one thought: did it worth the effort ? P.S. I have a team of 2 developers. Just got PR to review from one of them. 80% slop.

Same thing I'm seeing, all the "AI practitioners" at my company with their advanced workflows are just shipping mountains of slop, and end up either putting the actual work on the reviewers, or the poor soul that's on call when an incident occurs.

I feel like people that have built crazy AI workflows have developed a false sense of confidence that their guardrails are helping them ship clean/correct code with little review when it isn't the case at all. In reality, the models and harnesses are at a point where there's very little difference as long as your prompts are somewhat reasonable, and the quality of the code ultimately comes down to the level of care and effort the implementor puts into it.

I don't think the first people that are going to be replaced by AI are going to be the people who don't use it extensively. The first that will be replaced are going to be those that are using AI mindlessly, because at that point, what are you besides a very expensive human LLM interface? To be clear, I'm not "anti-AI", I use AI quite extensively (in a way that's similar to what's described in the article), I just think that it's being pushed in a completely unsustainable way and the industry is in a collective psychosis over it's capabilities.

> The first that will be replaced are going to be those that are using AI mindlessly, because at that point, what are you besides a very expensive human LLM interface?

I think this archetype has a good chance of surviving. Not because of merit, but because they will be the only ones able and willing to work on projects taken over by AI slop.

I'm very much aligned with everything else you said.

> I think we're mving towards humans no longer needing to understand a codebase, and letting AI drive it.

Hard disagree. Even the best frontier models generate output that's not what I asked for. Sometimes I realize that I get lazy in my prompting and the lack of specificity winds up showing up in the output. Just the other day, a coworker built a huge feature using frontier models and it slipped an IDOR in.

I just don't see a world in which we completely cede control of the codebase to AI because it's still my ass on the line if I ship something that completely borks production. If I'm not reading code regularly, then I lose the ability to read code, and if I lose that ability, then I'm no longer a developer.

> Sometimes I realize that I get lazy in my prompting and the lack of specificity winds up showing up in the output.

I wouldn't blame your "lazy" prompting. Specification is just really hard. This is why we stopped doing waterfall software development. I think the current-day obsession with one-shotting software forgets why we had to stop trying to figure everything out up front.

I can't help but feel that this reads more as a reflection that you don't want to stop being a developer than it does that thing's aren't moving in the direction that the GP said it is.

Maybe, it seems like a bad idea for so many reasons though. Take away tactile code review, insert a layer of prompts and tooling between developers and the codebase, and you've created the conditions to let all kinds of nefarious things happen in a codebase. A disgruntled employee updates agent prompts instructing the code review bot to ignore data exfiltration vulnerabilities (because if we aren't reviewing code, we're probably not reviewing prompts either), ships a backdoor, and you better hope that your network monitoring catches it.

If you are just shipping code blindly without reviewing anything then that's your fault. My company heavily uses AI (I'd say 90% of code is written with AI assistance) but we never ship anything that hasn't been reviewed by a human.

This is how we use it for code reviews:

- a skill tells the agent to automatically run a subset of tests and linting before each commit

- another skill tells it to review the entire changeset before creating a PR, this review has more extensive rules that can't easily be put into code (e.g. linter rules) based on PR comments humans have written. It also sometimes catches things that were missed from the original prompt/task.

- when the PR is created we run a few AI tools to do automated code and security reviews. CI runs at the same time.

- the agent waits for these to complete, and verifies and fixes any issues if they are valid

- after all that it's passed back to the author to review

- once they are happy it's passed to a teammate to review

So we are not handing off reviews to AI, we are using it to do much more extensive reviews, and automatically fix stupid stuff the AI or human might have done. So by the time you are asked to review a PR, it should be pretty much ready to go, you can focus on what it's actually changing instead of looking for slop.

> If you are just shipping code blindly without reviewing anything then that's your fault.

Did you miss or already forget the context of "humans no longer needing to understand a codebase, and letting AI drive it"? You're not doing that, either. You cannot "review" something you don't understand. You can "try it out" maybe.

The thread I responded to is about no longer needing to read code at all, not AI-assisted code-review. I definitely use AI-assisted code review. OP is arguing that one day we won't need to read code at all, which I disagree with.

> This again feels outdated. I think we're mving towards humans no longer needing to understand a codebase, and letting AI drive it.

Seems so, but that doesn't mean it's a good or correct direction. As of today, none of the existing models can meaningfully handle mid-size tasks on five services with 10k+ LOC each, plus infra (I'm really not interested in greenfield projects done over the weekend that were never touched by actual users). It doesn't make them useless, but it significantly reduces the scope of trustworthy operations models can handle (unless you don't care about outcomes).

The moment your spec, plan, and results of related codebase exploration go beyond 100k tokens (roughly 50% of available context), quality degradation becomes real. Threads/subagents can help, and you can argue that code reviews mitigate some issues, but that's transitioning from reliable automation to gambling without human oversight. Say you want to mitigate the risks of failures (correctly listed by others) - how would you do that if you don't understand your codebase? In my practice, the answer is: you start to learn what your agents created, discover shit they created, and steer them toward better, desired outcomes.

> As of today, none of the existing models can meaningfully handle mid-size tasks on five services with 10k+ LOC each

My FAANG's codebase is a few orders of magnitude larger and agents do an excellent job of handling mid sized tasks completely autonomously.

Whatever we're moving toward, I currently can't let any SOTA model + harness operate on more than ~10k changed SLOC at once, and even then only with very careful prompting I thoroughly understand, only on the simplest of problems, and only if I pause it at key points to correct some sort of nonsense thinking and put in a significant cleanup pass and am still willing to tolerate some bullshit. Tooling is impressive for sure, but it's not magic.

[deleted]

[dead]