> These samples have very good scores overall, but they are useless. I am guessing it's not English text... I counted a few hundred examples mostly from LOC-PD and other few hundred in the OTA datasets. Imagine if I feed that crap to my LLM, what will it learn?
im pretty sure its a real text in Welsh. there might be typos from ocr but yeah thats what the language really looks like, i dont speak it but its easy to recognize.
Yeah, that seems like an important distinction
It looks like ROT13 text to me, I hope it's not Welsh. Don't want to offend anyone if that's their actual language :)
It's actually Welsh, and the funny thing is that one of the sentences in the example "gibberish" text (although with some further OCR errors) means:
"It will be easy for the knowledgeable to fix the few errors that remain [in the text]". (Bydd yn rwydd iawn i'r cyfarwydd ddiwygio'r ychydig.")
Which is exactly what the OP is doing.