Hacker News

The best thing about this is that AI bots will read, train on and digest the million "how to write with AI" posts that are being written right now by some of the smartest coders in the world and the next gen AI will incorporate all of this, making them ironically unnecessary.

kimixa 2 months ago [ - ]

None of this is new, it was pretty much all "best practice" for decades and so already in the training data for the first generation.

If the issue is SNR and the ratio of "good" vs "bad" practices in the input training corpus, I don't know if that's getting better.

klysm 2 months ago [ - ]

They will also be reading all of the slop generated by the current and previous generations of LLMs

coldtea 2 months ago [ - ]

Each extra generation of AI produced crap AI consumes as training, the worse it gets. This has been mathematically proven.

jatora 2 months ago [ - ]

Strange since, in practice, coding models have steadily improved without any backward movement every 3-4 months for 2 years now. It's as if there are rigorous methods of filtering and curation applied when building your training data.

coldtea 2 months ago [ - ]

>Strange since, in practice, coding models have steadily improved without any backward movement every 3-4 months for 2 years now. It's as if there are rigorous methods of filtering and curation applied when building your training data.

It's as if what I wrote implies "all other things being equal", just like any technical claim.

All other things were not equal: the architectures were tweaked, the human data set is still not exhausted, and more money and energy was thrown into their performance since it's a pre-IPO game with huge VC stakes.

We've already seen a plateau non-the-less compared to the earlier release-over-release performance improvements. Even the "without any backward movement every 3-4 months for 2 years now" is hardly arguable. Many saw a backward movement with GPT 4.1 vs 4.0, and similar issues with 4.5, for example. Even if those are isolated, they're hardly the 2 to 3.5 to 4.0 gains.

And no, there are absolutely no "rigorous methods of filtering and curation" that can separate the avalance of AI slop from useful human output - at least not without diminishing the possible training data. The problem after all is not just to tell AI from human with automated curation (that's already impossible), the problem is to have enough valuable new human output, which becomes near a losing game as all aspects of "human" domains previously useful as training input (from code to papers) are tarnished by AI output.

jatora 2 months ago [ - ]

1. No, you dont get to fall back on the technical claim approach. Your bias in your phrasing was clear. Maybe that works for you but I won't just ignore obvious subtext and let you weasel out of this. And that's for the benefit of other readers, not you.

2. A plateau in coding performance? I don't think you even use these models for coding then if you make that claim. It is very clear models have continually improved. You can trust benchmarks to make that clear, or real world use, or better yet: both. You seem to not have the data from either.

3. No rigorous methods of filtering and curation that can separate AI slop from useful human output? Here you go:

a. Curation already works at scale. Modern training pipelines don’t rely on “AI vs human” detection. They filter by utility signals: correctness, novelty, coherence, task success, citation integrity, and cross-source consistency. These measurable properties do correlate with downstream model performance. Models trained on smaller, higher-quality corpora consistently outperform those trained on larger, noisier ones.

b. Human-generated “valuable” data is not shrinking. The claim assumes a fixed pool. In reality, high-value human data is expanding in areas that matter most: expert-labeled datasets, preference comparisons, multimodal demonstrations, tool-use traces, verified code with tests, and domain-expert feedback. These are explicitly created for training and are not polluted by passive AI spam.

c. Synthetic data is not a dead end—when constrained. Empirically, filtered and goal-conditioned synthetic data (self-play, distillation, adversarial generation) improves reasoning, math, coding, and tool use. The failure mode is unfiltered synthetic recursion—not synthetic data per se. This distinction is already operationalized in production systems.

d. Training value ≠ raw text volume. Scaling laws shifted: performance now tracks effective compute × data quality, not sheer token count. A smaller dataset with higher signal density produces better generalization than a massive, contaminated corpus. This is observed repeatedly in ablation studies.

----

Again, the above is not for you, as I believe you don't see beyond your cope (yet). It's for other readers who are intellectually curious.

chrisjj 2 months ago [ - ]

> AI bots will read, train on and digest the million "how to write with AI" posts that are being written right now

Yes!

> by some of the smartest coders in the world

Hmm... How will it filter out those by the dumbest coders in the world?

Including those by parrots?

lz400 2 months ago [ - ]

>Hmm... How will it filter out those by the dumbest coders in the world?

if you know, and I know, and the guys at openai and anthropic know... not a big leap that the models will know too? many datasets are curated and labeled by humans

chrisjj 2 months ago [ - ]

> if you know, and I know,

We don't know.

> and the guys at openai and anthropic know... not a big leap that the models will know too?

The models don't "know" anything. They just regurgitate what they are fed.

"Child abuse images found in AI training data"

https://www.axios.com/2023/12/20/ai-training-data-child-abus...

> many datasets are curated and labeled by humans

Including these ones: "AI industry insiders launch site to poison the data that feeds them"

https://www.theregister.com/2026/01/11/industry_insiders_see...

chrisjj 2 months ago [ - ]

> having a curated dataset of the works and posts of the top 200 coders in the world

I can't imagine many of the top 200 coders in the world giving their work to the parrots.

But show me the list of the top 200 coders in the world, and I might change my mind! :)

lz400 2 months ago [ - ]

Top 200 that work partially in public. A good example is Mitchell Hashimoto. Works open source, uses AI a lot and writes about it. Next gen AI will learn from the lessons people like him share

chrisjj 2 months ago [ - ]

> uses AI a lot

https://en.wikipedia.org/wiki/Model_collapse

lz400 2 months ago [ - ]

I mean, having a curated dataset of the works and posts of the top 200 coders in the world (at least the public ones) is not very difficult. I’m sure these articles like the one in OP will be very easy to mark as “high value training data”. I think you’re letting your bias blind you