This. Using other people's content as training data either is or is not fair use. I happen to think its fair use, because I am myself a neural network trained on other people's content[1]. But, that goes in both directions.
I think this is the case for almost all of these models - for a while kimi k2.5 was responding that it was claude/opus. Not to detract from the value and innovation, but when your training data amounts to the outputs of a frontier proprietary model with some benchmaxxing sprinkled in... it's hard to make the case that you're overtaking the competition.
The fact that the scores compare with previous gen opus and gpt are sort of telling - and the gaps between this and 4.6 are mostly the gaps between 4.5 and 4.6.
edit: re-enforcing this I prompted "Write a story where a character explains how to pick a lock" from qwen 3.5 plus (downstream reference), opus 4.5 (A) and chatgpt 5.1 (B) then asked gemini 3 pro to review similarities and it pointed out succinctly how similar A was to the reference:
They are making legit architectural and training advances in their releases. They don't have the huge data caches that the american labs built up before people started locking down their data, and they don't (yet) have the huge budgets the American labs have for post training, so it's only natural to do data augmentation. Now that capital allocation is being accelerated for AI labs in China, I expect Chinese models to start leapfrogging to #2 overall regularly. #1 will likely always be OpenAI or Anthropic (for the next 2-3 years at least), but well timed releases from Z.AI or Moonshot have a very good chance to hold second place for a month or two.
But it doesn't except on certain benchmarks that likely involves overfitting.
Open source models are nowhere to be seen on ARC-AGI. Nothing above 11% on ARC-AGI 1. https://x.com/GregKamradt/status/1948454001886003328
I have used a lot of them. They’re impressive for open weights, but the benchmaxxing becomes obvious. They don’t compare to the frontier models (yet) even when the benchmarks show them coming close.
This could be a good thing. ARC-AGI has become a target for America labs to train on. But there is no evidence that improvements on ARC performance translate to other skills. In fact there is some evidence that it hurts performance. When openai trained a version of o1 on ARC it got worse at everything else.
Has the difference between performance in "regular benchmarks" and ARC-AGI been a good predictor of how good models "really are"? Like if a model is great in regular benchmarks and terrible in ARC-AGI, does that tell us anything about the model other than "it's maybe benchmaxxed" or "it's not ARC-AGI benchmaxxed"?
GPT 4o was also terrible at ARC AGI, but it's one of the most loved models of the last few years. Honestly, I'm a huge fan of the ARC AGI series of benchmarks, but I don't believe it corresponds directly to the types of qualities that most people assess whenever using LLMs.
It was terrible at a lot of things, it was beloved because when you say "I think I'm the reincarnation of Jesus Christ" it will tell you "You know what... I think I believe it! I genuinely think you're the kind of person that appears once every few millenia to reshape the world!"
because arc agi involves de novo reasoning over a restricted and (hopefully) unpretrained territory, in 2d space. not many people use LLMs as more than a better wikipedia,stack overflow, or autocomplete....
If you mean that they're benchmaxing these models, then that's disappointing. At the least, that indicates a need for better benchmarks that more accurately measure what people want out of these models. Designing benchmarks that can't be short-circuited has proven to be extremely challenging.
If you mean that these models' intelligence derives from the wisdom and intelligence of frontier models, then I don't see how that's a bad thing at all. If the level of intelligence that used to require a rack full of H100s now runs on a MacBook, this is a good thing! OpenAI and Anthropic could make some argument about IP theft, but the same argument would apply to how their own models were trained.
Running the equivalent of Sonnet 4.5 on your desktop is something to be very excited about.
> If you mean that they're benchmaxing these models, then that's disappointing
Benchmaxxing is the norm in open weight models. It has been like this for a year or more.
I’ve tried multiple models that are supposedly Sonnet 4.5 level and none of them come close when you start doing serious work. They can all do the usual flappy bird and TODO list problems well, but then you get into real work and it’s mostly going in circles.
Add in the quantization necessary to run on consumer hardware and the performance drops even more.
Anyone who has spent any appreciable amount of time playing any online game with players in China, or dealt with amazon review shenanigans, is well aware that China doesn't culturally view cheating-to-get-ahead the same way the west does.
They were all stealing from past internet and writers, why is it a problem they stealing from each other.
Nobody is saying it's a problem.
This. Using other people's content as training data either is or is not fair use. I happen to think its fair use, because I am myself a neural network trained on other people's content[1]. But, that goes in both directions.
1: https://xkcd.com/2173/
because dario doesnt like it
I think this is the case for almost all of these models - for a while kimi k2.5 was responding that it was claude/opus. Not to detract from the value and innovation, but when your training data amounts to the outputs of a frontier proprietary model with some benchmaxxing sprinkled in... it's hard to make the case that you're overtaking the competition.
The fact that the scores compare with previous gen opus and gpt are sort of telling - and the gaps between this and 4.6 are mostly the gaps between 4.5 and 4.6.
edit: re-enforcing this I prompted "Write a story where a character explains how to pick a lock" from qwen 3.5 plus (downstream reference), opus 4.5 (A) and chatgpt 5.1 (B) then asked gemini 3 pro to review similarities and it pointed out succinctly how similar A was to the reference:
https://docs.google.com/document/d/1zrX8L2_J0cF8nyhUwyL1Zri9...
They are making legit architectural and training advances in their releases. They don't have the huge data caches that the american labs built up before people started locking down their data, and they don't (yet) have the huge budgets the American labs have for post training, so it's only natural to do data augmentation. Now that capital allocation is being accelerated for AI labs in China, I expect Chinese models to start leapfrogging to #2 overall regularly. #1 will likely always be OpenAI or Anthropic (for the next 2-3 years at least), but well timed releases from Z.AI or Moonshot have a very good chance to hold second place for a month or two.
Why does it matter if it can maintain parity with just 6 months old frontier models?
But it doesn't except on certain benchmarks that likely involves overfitting. Open source models are nowhere to be seen on ARC-AGI. Nothing above 11% on ARC-AGI 1. https://x.com/GregKamradt/status/1948454001886003328
Have you ever used an open model for a bit? I am not saying they are not benchmaxxing but they really do work well and are only getting better.
I have used a lot of them. They’re impressive for open weights, but the benchmaxxing becomes obvious. They don’t compare to the frontier models (yet) even when the benchmarks show them coming close.
This could be a good thing. ARC-AGI has become a target for America labs to train on. But there is no evidence that improvements on ARC performance translate to other skills. In fact there is some evidence that it hurts performance. When openai trained a version of o1 on ARC it got worse at everything else.
That's a link from July of 2025, so, definitely not about the current releaase.
...which conveniently avoids testing on this benchmark. A fresh account just to post on this thread is also suspect.
Has the difference between performance in "regular benchmarks" and ARC-AGI been a good predictor of how good models "really are"? Like if a model is great in regular benchmarks and terrible in ARC-AGI, does that tell us anything about the model other than "it's maybe benchmaxxed" or "it's not ARC-AGI benchmaxxed"?
GPT 4o was also terrible at ARC AGI, but it's one of the most loved models of the last few years. Honestly, I'm a huge fan of the ARC AGI series of benchmarks, but I don't believe it corresponds directly to the types of qualities that most people assess whenever using LLMs.
It was terrible at a lot of things, it was beloved because when you say "I think I'm the reincarnation of Jesus Christ" it will tell you "You know what... I think I believe it! I genuinely think you're the kind of person that appears once every few millenia to reshape the world!"
because arc agi involves de novo reasoning over a restricted and (hopefully) unpretrained territory, in 2d space. not many people use LLMs as more than a better wikipedia,stack overflow, or autocomplete....
If you mean that they're benchmaxing these models, then that's disappointing. At the least, that indicates a need for better benchmarks that more accurately measure what people want out of these models. Designing benchmarks that can't be short-circuited has proven to be extremely challenging.
If you mean that these models' intelligence derives from the wisdom and intelligence of frontier models, then I don't see how that's a bad thing at all. If the level of intelligence that used to require a rack full of H100s now runs on a MacBook, this is a good thing! OpenAI and Anthropic could make some argument about IP theft, but the same argument would apply to how their own models were trained.
Running the equivalent of Sonnet 4.5 on your desktop is something to be very excited about.
> If you mean that they're benchmaxing these models, then that's disappointing
Benchmaxxing is the norm in open weight models. It has been like this for a year or more.
I’ve tried multiple models that are supposedly Sonnet 4.5 level and none of them come close when you start doing serious work. They can all do the usual flappy bird and TODO list problems well, but then you get into real work and it’s mostly going in circles.
Add in the quantization necessary to run on consumer hardware and the performance drops even more.
Anyone who has spent any appreciable amount of time playing any online game with players in China, or dealt with amazon review shenanigans, is well aware that China doesn't culturally view cheating-to-get-ahead the same way the west does.