Beats Opus 4.5 on reasoning you say?
Prompt: If A goes to B who then goes to C, can A send something to C?
Response:
We need to interpret best. The phrase "If A goes to B who then goes to C, can A send something to C?" could be a puzzle about the concept of sending something (like passing a ball) and the relationships.
Scenario: A gives something to B, and B passes it on to C. Question: Can A also give the same thing to C? Answer: Only if A can obtain a second copy (e.g., the thing was duplicated). Otherwise, after handing it to B, A no longer holds it and cannot “send” it unless a copy exists.
[Lots of other unnecessary commentary and "scenarios" that make even lesser sense]
I am a human and I don't know how to interpret this prompt.
Ran the same query and there is a ton of stuff, but it looks like it's reasoning through the ambiguity of the sentence. It still gets the right answer. Moreover, if we consider the FLOPs expended to get to the answer, and compare that to opus, I think it's still a net win.
My hunch is that Opus scale models probably have shortcuts encoded into the model that handle these ambiguities cases, wheres this model has learned a program to reason through the edge case (crystalized vs fluid intelligence). Remembering that probablity (frontier) vs calculating it on the fly (vibethink)
If A goes to B who then goes to C does C know A?
> Multi-level Quality Control.
> [...]
> LLM-based Query Quality Filtering. We utilize capable LLMs to assess query quality, filtering out samples with incomplete descriptions, unreasonable conditions, invalid logic, or an inability to effectively assess target knowledge points.