Hacker News

echelon 4 days ago [ - ]

Can't each of these companies with IDE integrations slurp up the network traffic and distill Anthropic's models?

If you can listen to billions of tokens a day, you can basically capture all the magic.

jasonjmcghee 4 days ago [ - ]

ceejayoz 4 days ago [ - ]

How much of the training set comes from websites with "no automated scraping" in their terms?

echelon 4 days ago [ - ]

The companies stole that data from the world, so I don't see why we couldn't take it back.

TimeBearingDown 4 days ago [ - ]

It's a nice sentiment. The companies with the integrations are the ones that could take it back, but they don't have the incentive to break legal agreements and share with the world.

Meanwhile the creative output of humanity is distilled into black boxes to benefit those who can scrape it the most and burn the most power, but this impact is distributed amongst everyone, so again there's little incentive among those who could create (likely legal) change.

adastra22 4 days ago [ - ]

That is not how training works…

echelon 3 days ago [ - ]

That's how model distillation works.

DeepSeek is the most notable case, but it's been used lots.

And the foundation model companies are scraping and exfiltrating each others' data.