I agree with that - with Stage we're not trying to replace reading code with AI summaries, but rather guiding the reviewer through reading code in the way that makes most sense and coming away with the best understanding
I agree with that - with Stage we're not trying to replace reading code with AI summaries, but rather guiding the reviewer through reading code in the way that makes most sense and coming away with the best understanding
How do you handle the problem of AI misleading by design? For example, Claude already lies on a regular basis specifically (and quite convincingly) in this case, in attempts to convince that what is actually broken isn't such a big deal after all or similar.
How can this product possibly improve the status quo of AI constantly, without end, trying to 'squeak things by' during any and all human and automated review processes? That is, you are giving the AI which already cheats like hell a massive finger on the scale to cheat harder. How does this not immediately make all related problems worse?
The bulk of difficulty in reviewing AI outputs is escaping the framing they never stop trying to apply. It's never just some code. It's always some code that is 'supposed to look like something', alongside a ton of convincing prose promising that it _really_ does do that thing and a bunch of reasons why checking the specific things that would tell you it doesn't isn't something you should do (hiding evidence, etc).
99% of the problem is that the AI already has too much control over presentation when it is motivated about the result of eval. How does giving AI more tools to frame things in a narrative form of its choice and telling you what to look at help? I'm at a loss.
The quantity of code has never been a problem. Or prose. It's that all of it is engineered to mislead / hide things in ways that require a ton of effort to detect. You can't trust it and there's no equivalent of a social cost of 'being caught bullshitting' like you have with real human coworkers. This product seems like it takes that problem and turns the dial to 11.
Thanks for sharing this, I do agree with a lot of what you said especially around trust around what its actually telling you
For me, I only run into problems of an agent misleading/lying to me when working on a large feature, where the agent has strong incentive to lie and pretend like the work is done. However, there doesn't seem to be this same incentive for a completely separate agent that is just generating a narrative of a pull request. Would love to hear what you think
There is no separation. Incentive propagates through LLMs with approximately 0 resistance. If the input tells a story, the output tends to that story reinforced.
The code/PR generator is heavily incentivized to spin by RL on humans - as soon as that spin comes into contact with your narrative gen context, it's cooked. Any output that has actually seen the spin is tainted and starts spinning itself. And then there's also spin originating in the narrative gen... Hence, the examples read like straight advertisements, totally contaminated, shot through with messaging like:
- this is solid, very trustworthy
- you can trust that this is reliable logic with a sensible, comprehensible design
- the patterns are great and very professional and responsible
- etc
If the narrative reads like a glow up photoshoot for the PR, something has gone extremely wrong. This is not conducive to fairly reviewing it. It is presented as way better than it actually is. Even if there are no outright lies, the whole thing is a mischaracterization.
RL is a hell of a drug.
Anyway, this is the problem of AI output. It cannot be trusted that the impression it presents is the reality or even a best attempt at reality. You have to carefully assemble your own view of the real reality in parallel to w/e it gives you, which is a massive pain in the ass. And if you skip that, you just continually let defects/slop through.
Worst problem mucking things up is basically that RL insights that work on people also work on AI, because the AI is modelling human language patterns. Reviewing slop sucks because it's filled with (working) exploits against humans. And AI cannot help because it is immediately subverted. So I guess it requires finding a way to strip out the exploits without changing mechanical details. But hard, because it saturates 100% of output at many levels of abstraction including the mechanical details.
But how do you know they’re not lying to you? What are your benchmarks for this? Experience? Anecdote? Data?
And I’m asking you in good faith - not trying to argue.
I’m thinking about these types of questions on a daily basis, and I love to see others thinking about them too.