I evaluate how good models are now by how good they are at removing code.
It’s fairly simple (assuming the test harness and agents.md are well written): do iterations of trying to remove code, ensure it passes, then have a human review it. Less code to review that way.