1000% this, this was us internally testing if our harness worked, the motivation was never to test them in-depth 1v1. We were just really shocked at the results, there’s a lot more work to do here.

Can you run Claude Opus through the same Pydantic harness and add the cost to the benchmark result table? An isolated price is meaningless.