>like distillation attacks. I don't blame them; its the obvious only strategy when you cant compete in compute
>distillation attacks are the only vector to keep up
It's demonstrably wrong, they invest in architectural improvements as well, for example, DeepSeek's compressed attention. When you lack compute, you need fast training/fast inference, and distillation alone doesn't solve it. From what I understand, that kind of distillation "attack" (28 mln exchanges) only slightly improves instruction tuning/reasoning traces. If the base model is crap, distilling Claude on a few million exchanges alone won't magically make your model as good as Chinese models currently are (or magically make inference faster on the limited hardware they have). And training the base model needs a proper training run. Serving users at scale needs optimized architectures as well, especially with test-time compute and ever growing context lengths. That's where architectural innovations are happening in Chinese labs when it comes to compute.
I explicitly called out the fact that there is plenty of innovation, but that we see t Lots of innovation in both Chinese and U.S. labs, and I don't think that there is a co.parative difference there.