The comprehensive sources of good content, such as Wikipedia, and major news outlets, do seem to go into LLMs and come out the other side.
The comprehensive sources of good content, such as Wikipedia, and major news outlets, do seem to go into LLMs and come out the other side.
It's funny you mention Wikipedia, I wonder if (at least since the early days when Wikipedia was the big scary thing on the block) anyone has run the same sorts of searches for plagiarized material and "hallucinations" against Wikipedia. After all, Wikipedia explicitly forbids "original" research, which means all of the output of Wikipedia is by definition a regurgitation of someone else's work. Yes you're supposed to cite everything, but between the number of things that are [Citation Needed] and the number of cites that don't seem to actually go to anything, there's almost certainly a good amount of "hallucinations" in there too (see also the effectively the entire Scots Wikipedia https://www.theregister.com/2020/08/26/scots_wikipedia_fake/). And that doesn't get into whether the factual things that are cited gave permission to the editors to use their material in the first place. Of course, I would argue Wikipedia is sufficiently transformative (and facts aren't subject to copyright anyway) and is overall a net good despite its problems. But I also argue the same of the various LLMs and their outputs too.