Hacker News

They could do that years ago, it's just that nobody seems to do it. Just hook it up to curated semantic knowledge bases.

Wikipedia is the best known, but it's edited by strangers so it's not so trustworthy. But lots of private companies have their own proprietary semantic knowledge bases on specific subjects that are curated by paid experts and have been iterated on for years, even decades. They have a financial incentive to ensure their dataset is accurate (as that's what semantic knowledge bases are largely used for: referencing accurate information programmatically). So they are a lot more trustworthy than "I found a Reddit post that says..."

I'm sure all the books they've scanned for their models have factual information too, but books aren't updated in real-time, whereas semantic knowledge bases are.

justinator 4 days ago [ - ]

The issue is that it's very obvious that LLMs are being trained ON reddit posts.

mrweasel 3 days ago [ - ]

That's really the issue isn't it. Many of the LLMs are trained uncritically on very thing. All data is viewed as viable training data, but it's not. Reddit clearly have good data, but it's probably mostly garbage.