Does Anna's Archive or a similar site host, say, the complete New York Times (pre-1930) as a full PDF download set? And every other newspaper too?
Tons of public domain sources are locked into websites like Newspapers.com or the nearly-dead and now completely unsearchable old Google News / Newspaper.
It would be nice if the massive pursuit of AI training data resulted in some fully-legal open source alternatives to these proprietary, outdated, or abandoned sites. I know some of it is available via the Internet Archive, etc., but something new with an AI-powered search and finding aid sounds so useful.
> complete New York Times (pre-1930)
https://archive.org/search?query=title%3ANew+York+Times&sort...
> as a full PDF download set
I imagine it's possible to achieve this through torrents from Anna's, but you'd have to search and compile the list of all individual PDFs.
> something new with an AI-powered search
With enough time and willingness, someone could put all the old NYT issues through optical character recognition and convert them to text; then make it available to large language models for semantic search of some kind. Ideally public cultural funds could support the effort as academic research.
It just feels like the complete public domain New York Times should be a big deal. Why is it only available via individual issues in the Internet archive? Why hasn't every single story been cut out individually, fully OCR'd, so that it shows up as a top hit on Google? And do that for every public domain newspaper around the country, too.