It's funny: publishing work offline in books and magazines is perhaps more anonymous in the age of AI.

I pasted in a number of passages from books on my bookshelf. Predictably, stuff that I read for my English degree in university is largely in the training data and easily identifiable. Stuff from regional authors or is slightly adjacent to the cultural mainstream makes no impression.

To clarify, because a number of posts here sort of suggest the confusion:

the article here isn't about the LLM recognizing works that were in the training data. EG, The Old Man and the Sea off the shelf. It's about pegging the author of novel texts, like, say, some letter written by Hemmingway that gets discovered next week and was never before digitized.

It is for now.

But I'm sure the scanning operations will start scouring the earth even harder for any books unaffected by slop containing niche knowledge and text in order for their models to have an edge over the ones trained only on pirate collections and the Internet.

I wonder if secondhand bookshops and deceased estates are seeing bulk buyers of their stock suddenly appearing. Maybe broke governments/municipalities will start selling them entire libraries and archives to ingest.