Hacker News

I would claim that LLMs desperately need proprietary code in their training, before we see any big gains in quality.

There's some incredible source available code out there. Statistically, I think there's a LOT more not so great source available code out there, because the majority of output of seasoned/high skill developers is proprietary.

To me, a surprising portion of Claude 4.5 output definitely looks like student homework answers, because I think that's closer to the mean of the code population.

typ 4 minutes ago [ - ]

I'd bet, on average, the quality of proprietary code is worse than open-source code. There have been decades of accumulated slop generated by human agents with wildly varied skill levels, all vibe-coded by ruthless, incompetent corporate bosses.

andai 2 hours ago [ - ]

Let's start with the source code for the Flash IDE :)

bhadass 2 hours ago [ - ]

yeah, but isn't the whole point of claude code to get people to provide preference data/telemetry data to anthropic (unless you opt out?). same w/ other providers.

i'm guessing most of the gains we've seen recently are post training rather than pretraining.

nomel an hour ago [ - ]

Yes, but you have the problem that a good portion of that is going to be AI generated.

But, I naively assume most orgs would opt out. I know some orgs have a proxy in place that will prevent certain proprietary code from passing through!

This makes me curious if, in the allow case, Anthropic is recording generated output, to maybe down-weight it if it's seen in the training data (or something similar)?