I did exactly that and it's all covered in the blog post. There's no hidden eval harness, it's in the same codebase as the CLI so others can reproduce and/or extend as they see fit. It also includes code editing tasks and measures them too. The only asterisk on the code editing is I didn't automate the reporting of accuracy because the test only uses Claude and having it judge it's own work seemed dubious, and having our existing parsers + policy checks verify Claude's output in a benchmark test like this might look like we were cooking the books in our favor (i.e., we're testing and verifying using our own system which obviously we will always get 100% on). Writing up a whole new independent Terraform parser or test harness to verify the results was beyond the scope of what I was willing to do for this just right now. So I opted for a "just assume Claude always gets it right", and we reported on just the token differences to get there.

Sorry, I missed the Open Items section. You're right about that, designing a good eval harness can be difficult and expensive. Maybe we need some kind of community project for agentic evals, where people can share eval harnesses and run logs.