Can't each of these companies with IDE integrations slurp up the network traffic and distill Anthropic's models?
If you can listen to billions of tokens a day, you can basically capture all the magic.
Can't each of these companies with IDE integrations slurp up the network traffic and distill Anthropic's models?
If you can listen to billions of tokens a day, you can basically capture all the magic.
Terms of service specifically prohibits this.
How much of the training set comes from websites with "no automated scraping" in their terms?
The companies stole that data from the world, so I don't see why we couldn't take it back.
It's a nice sentiment. The companies with the integrations are the ones that could take it back, but they don't have the incentive to break legal agreements and share with the world.
Meanwhile the creative output of humanity is distilled into black boxes to benefit those who can scrape it the most and burn the most power, but this impact is distributed amongst everyone, so again there's little incentive among those who could create (likely legal) change.
That is not how training works…
That's how model distillation works.
DeepSeek is the most notable case, but it's been used lots.
And the foundation model companies are scraping and exfiltrating each others' data.