> They both completed the task perfectly except the "best" model (the bigger one) cost 5x more and took 3x longer...
Same for me, I certainly don't have the same definition of success and failure either.
A more expensive model has *less* rooms for wandering around than a cheaper model.
If Claude wanders around during 10min until finding the most obvious solution, then I count it as a failure.