Feeling very much the same. Attempting to use it through Claude Code as a model it just completely lost all context on what it was doing after a few months and kept short circuiting even with the most helpful prompts I could give, outside of just writing out the answer myself. I really do not get the praise for this model.
Being "better than Opus 4.6" is not really something a benchmark will tell you. It's much more a consensus of users liking the flavor of an answer, rather than fueling x% correct on a benchmark.
Feeling very much the same. Attempting to use it through Claude Code as a model it just completely lost all context on what it was doing after a few months and kept short circuiting even with the most helpful prompts I could give, outside of just writing out the answer myself. I really do not get the praise for this model.
Being "better than Opus 4.6" is not really something a benchmark will tell you. It's much more a consensus of users liking the flavor of an answer, rather than fueling x% correct on a benchmark.