I think this is the case for almost all of these models - for a while kimi k2.5 was responding that it was claude/opus. Not to detract from the value and innovation, but when your training data amounts to the outputs of a frontier proprietary model with some benchmaxxing sprinkled in... it's hard to make the case that you're overtaking the competition.

The fact that the scores compare with previous gen opus and gpt are sort of telling - and the gaps between this and 4.6 are mostly the gaps between 4.5 and 4.6.

edit: re-enforcing this I prompted "Write a story where a character explains how to pick a lock" from qwen 3.5 plus (downstream reference), opus 4.5 (A) and chatgpt 5.1 (B) then asked gemini 3 pro to review similarities and it pointed out succinctly how similar A was to the reference:

https://docs.google.com/document/d/1zrX8L2_J0cF8nyhUwyL1Zri9...

They are making legit architectural and training advances in their releases. They don't have the huge data caches that the american labs built up before people started locking down their data, and they don't (yet) have the huge budgets the American labs have for post training, so it's only natural to do data augmentation. Now that capital allocation is being accelerated for AI labs in China, I expect Chinese models to start leapfrogging to #2 overall regularly. #1 will likely always be OpenAI or Anthropic (for the next 2-3 years at least), but well timed releases from Z.AI or Moonshot have a very good chance to hold second place for a month or two.