I think this is an interesting idea, but I also somewhat suspect you've replaced a tedious problem with a harder, more tedious problem.

Take your idea further. Now I've got 100 agents, and 100 PRs, and some small percentage of them are decent. The task went from "implement a feature" to "review 100 PRs and select the best one".

Even assuming you can ditch 50 percent right off the bat as trash... Reviewing 50 potentially buggy implementations of a feature and selecting the best genuinely sounds worse than just writing the solution.

Worse... If you haven't solved the problem before anyways, you're woefully unqualified as a reviewer.

Theory of constraints is clear: speeding up something that isn't a bottleneck worsens system performance.

The idea that too little code is the problem is the problem. Code is liability. Making more of it faster (and probabilistic) is a fantastically bad idea.

What if we replace PR with QA test rig? Then hire a bunch of QA monkies to find bugs in them and select the "bug free" one.

There should be test cases ran, coverage ensured. This is trivially automated. LLMs should also review the PRs, at least initially, using the test results as part of the input.

Who tests the tests? How do you know that the LLM-generated tests are actually asserting anything meaningful and cover the relevant edge cases?

The tests are part of the code that needs to be reviewed in the PR by a human. They don't solve the problem, they just add more lines to the reviewer's job.

So now either the agent is writing the tests, in which case you're right back to the same issue (which tests are actually worth running?) or your job is now just writing tests (bleh...).

And for the llm review of the pr... Why do you assume it'll be worth any more then the original implementation? Or are we just recursing down a level again (if 100 llms review each of the 100 PRs... To infinity and beyond!)

This by definition is not trivially automated.

The LLMs can help with the writing of the tests, but you should verify that they're testing critical aspects and known edge cases are covered. A single review-promoted LLM can then utilize those across the PRs and provide a summary for acceptance the the best. Or discard all and do manually; that initial process should only have taken a few minutes, so minimal wastage in the grand scheme of things, given over time there are a decent amount of acceptances, compared to the alternative 100% manual effort and associated time sunk.

The linked article from Steve Yegge (https://sourcegraph.com/blog/revenge-of-the-junior-developer) provides a 'solution', which he thinks is also imminent - supervisor AI agents, where you might have 100+ coding agents creating PRs, but then a layer of supervisors that are specialized on evaluating quality, and the only PRs that a human being would see would be the 'best', as determined by the supervisor agent layer.

From my experience with AI agents, this feels intuitively possible - current agents seem to be ok (thought not yet 'great') at critiquing solutions, and such supervisor agents could help keep the broader system in alignment.

>but then a layer of supervisors that are specialized on evaluating quality

Why would supervisor agents be any better than the original LLMs? Aren't they still prone to hallucinations and subject to the same limitations imposed by training data and model architecture?

It feels like it just adds another layer of complexity and says, "TODO: make this new supervisor layer magically solve the issue." But how, exactly? If we already know the secret sauce, why not bake it into the first layer from the start?

Similar to how human brains behave, it is easier to train a model to select a better solution between many choices than to check an individual solution for correctness [1], which is in turn an easier task to learn than writing a correct single solution in the first place.

[1] the diffs in logic can suggest good ideas that may have been missed in subsets of solutions.

Just add a CxO layer that monitors the supervisors! And the board of directors watches the CEO and the shareholders monitor the board of directors. It's agents all the way up!

[deleted]

LLMs are smarter in hindsight than going forward, sort of like humans! only they don't have such flexible self reflection loops so they tend to fall into local minima more easily.

This reads like it could result in "the blind, leading the blind". Unless the Supervisor AI agents are deterministic, it can still be a crapshoot. Given the resources that SourceGraph has, I'm still surprised they missed the most obvious thing, which is "context is king" and we need tooling that can make adding context to LLMs dead simple. Basically, we should be optimizing for the humans in the loop.

Agents have their place for trivial and non-critical fixes/features, but the reality is, unless the agents can act in a deterministic manner across LLMs, you really are coding with a loaded gun. The worst is, agents can really dull your senses over time.

I do believe in a future where we can trust agents 99% of the time, but the reality is, we are not training on the thought process, for this to become a reality. That is, we are not focused on the conversation to code training data. I would say 98% of my code is AI generated, and it is certainly not vibe coding. I don't have a term for it, but I am literally dictating to the LLM what I want done and have it fill in the pieces. Sometimes it misses the mark, sometimes it aligns and sometimes it introduces whole new ideas that I have never thought of, which will lead to a better solution. The instructions that I provide is based on my domain knowledge and I think people are missing the mark when they talk about vibe coding, in a professional context.

Full Disclosure: I'm working on improving the "conversation to code" process, so my opinions are obviously biased, but I strongly believe we need to first focus on better capturing our thought process.

I'm skeptical that we would need determinism in a supervisor in order for it to be useful. I realize it's not exactly analogous, but the current human parallel, with senior/principal/architect-level SWEs reviewing code from less experienced devs (or even similarly-/more-experienced devs) is far from deterministic, but certainly improves quality

Think about how differently a current agent behaves when you say "here is the spec, implement a solution" vs "here is the spec, here is my solution, make refinements" - you get very different output, and I would argue that the 'check my work' approach tends to have better results.