I mean asking these transformers to do maths has always been the wrong task. It's like we're now considering "it doesn't have x tools built with traditional code built in".

Though I suppose we're testing their model + agent harness here as well. It really _should_ have all of those tools/reasoning available to accomplish a task like the above without issue.

It's only been the wrong task because they've been deficient at it and expensive to use, so we had workarounds. They are getting better at these tasks and cheaper (sometimes). It's fair to evaluate even if there are more economical and accurate alternatives available.