Ran the same query and there is a ton of stuff, but it looks like it's reasoning through the ambiguity of the sentence. It still gets the right answer. Moreover, if we consider the FLOPs expended to get to the answer, and compare that to opus, I think it's still a net win.

My hunch is that Opus scale models probably have shortcuts encoded into the model that handle these ambiguities cases, wheres this model has learned a program to reason through the edge case (crystalized vs fluid intelligence). Remembering that probablity (frontier) vs calculating it on the fly (vibethink)