The problem is that this is very hard to replicate and benchmarks focus on E2E tests, going from one prompt to the final solution.
They do not test how models perform when used interactively, like most of us do.
The problem is that this is very hard to replicate and benchmarks focus on E2E tests, going from one prompt to the final solution.
They do not test how models perform when used interactively, like most of us do.