Frontier Math, GPQA Diamond, and Browsecomp are the benchmarks I noticed this on.
Are you may be comparing the pro model to the non pro model with thinking? Granted it’s a bit confusing but the pro model is 10 times more expensive and probably much larger as well.
Ah yes, okay that makes more sense!
Are you may be comparing the pro model to the non pro model with thinking? Granted it’s a bit confusing but the pro model is 10 times more expensive and probably much larger as well.
Ah yes, okay that makes more sense!