Specifically, we discard documents shorter than 200 characters (Xue
et al., 2021a), and any page containing the phrase “lorem ipsum,” the word “javascript,” or curly brackets (Raffel et al., 2023)....
It is quite surprising/funny to see all documents with javascript removed.
I did not know about EuroLLM. I had a look to the paper (https://arxiv.org/abs/2602.05879) describing it:
Specifically, we discard documents shorter than 200 characters (Xue et al., 2021a), and any page containing the phrase “lorem ipsum,” the word “javascript,” or curly brackets (Raffel et al., 2023)....
It is quite surprising/funny to see all documents with javascript removed.