for us it's (usually) very easy as I work on performance optimization. a non-negligible part of this is correctness and verifiability, so we already have some of that.
to give you an example just recently I've coded a feature that for our shuffle operation can report which channel did the bytes flow through (as the PR giving us the plumbing underneath has landed upstream recently). what this basically means is that you run the shuffle, you know you've shuffled X bytes (because you have stats on both ends) and then you need to attribute them to different layers. on the first iteration, the count was off. the agent went, debugged, fixed, iterated, and then it was 1.5% off. again, it went, iterated, ... and now we're fine.
part of the task description was that the breakdown must match the known amount of bytes we're shuffling, so the agent took this upon as a self-verification point. so besides running our normal, boring unit tests, integration tests and end-to-end verification harnesses (which it not only has programmatic/cli/API access, but are documented in .md files for projects), it could use this criteria on top to verify.
looking at /usage, my API duration was 2h 43m, and on top of that:
claude-haiku-4-5: 2.7k input, 115.3k output, 16.3m cache read, 867.9k cache write ($3.30)
claude-opus-4-8: 46.9k input, 555.0k output, 166.6m cache read, 2.9m cache write ($115.77)
Definitely agree that performance optimization is a good use case for LLMs. Here you have both a measurable goal / objective function and guardrails against functional regressions. It kind of closes the loop in that regard.
One thing however is a test suite is not usually exhaustive in the sense that any code that passes the tests is valid. Usually tests are more complimentary in nature. Therefore you could still possibly get code degradation, potentially.
> One thing however is a test suite is not usually exhaustive in the sense that any code that passes the tests is valid. Usually tests are more complimentary in nature.
Not in the world of AI - if your tests don't catch any known issues, the problem is the tests aren't comprehensive enough. There's no excuse at this point not to have an incredibly comprehensive test suite, to go with your other agent feedback loop constraints
>> if your tests don't catch any known issues, the problem is the tests aren't comprehensive enough.
Maybe I misunderstand but this seems like a fairly low bar in the test suite only covers existing bugs.
I'd argue that if you aren't going to look at the code you actually need a fully comprehensive test suite - in the sense that if the tests pass, the code is correct and you don't have to look at it at all. The problem is, that isn't very quick to create it seems. Of course, if there is a way to do it quickly in a way that is reproducible by others I'd love to hear about it.
I don't mean just bugs, I mean any known issues. I test infra, I test UI, I test binary protocols, you name it. There is certainly no fast way to do it, even with AI (an AI generated suite is better than nothing but not as good), and it's a serious investment, but it's worth it. Testing becomes a process of correctness checking that snowballs over time, making everything else easier and better (or else the tests need further adjustment!)