This already exists https://eurollm.io/

How do people not know about it and keep making stuff from scratch?

I did not know about EuroLLM. I had a look to the paper (https://arxiv.org/abs/2602.05879) describing it:

Specifically, we discard documents shorter than 200 characters (Xue et al., 2021a), and any page containing the phrase “lorem ipsum,” the word “javascript,” or curly brackets (Raffel et al., 2023)....

It is quite surprising/funny to see all documents with javascript removed.