> They both completed the task perfectly except the "best" model (the bigger one) cost 5x more and took 3x longer...

Same for me, I certainly don't have the same definition of success and failure either.

A more expensive model has *less* rooms for wandering around than a cheaper model.

If Claude wanders around during 10min until finding the most obvious solution, then I count it as a failure.