I'm wondering why all these token-saving solutions focus their benchmarks exclusively on simple Q&A tasks. If their tools truly saved money in real, long-term programming tasks, they would have definitely published those benchmark results instead of just Q&A tests, especially since a simple code editing benchmark with a hidden eval harness is very easy to design. Personally, asking a coding agent questions without any code editing is a very rare case for me
I did exactly that and it's all covered in the blog post. There's no hidden eval harness, it's in the same codebase as the CLI so others can reproduce and/or extend as they see fit. It also includes code editing tasks and measures them too. The only asterisk on the code editing is I didn't automate the reporting of accuracy because the test only uses Claude and having it judge it's own work seemed dubious, and having our existing parsers + policy checks verify Claude's output in a benchmark test like this might look like we were cooking the books in our favor (i.e., we're testing and verifying using our own system which obviously we will always get 100% on). Writing up a whole new independent Terraform parser or test harness to verify the results was beyond the scope of what I was willing to do for this just right now. So I opted for a "just assume Claude always gets it right", and we reported on just the token differences to get there.
Sorry, I missed the Open Items section. You're right about that, designing a good eval harness can be difficult and expensive. Maybe we need some kind of community project for agentic evals, where people can share eval harnesses and run logs.