I've had a suspicion for a bit that, since a large portion of the Internet is English and Chinese, that any other languages would have a much larger ratio of training material come from books.

I wouldn't be surprised if Arabic in particular had this issue and if Arabic also had a disproportionate amount of religious text as source material.

I bet you'd see something similar with Hebrew.

I think therein lies another fun benchmark to show that LLM don't generalize: ask the llm to solve the same logic riddle, only in different languages. If it can solve it in some languages, but not in others, it's a strong argument for just straightforward memorization and next token prediction vs true generalization capabilities.

I would expect that the "classics" have all been thoroughly discussed on the Internet in all major languages by now. But if you could re-train a model from scratch and control its input, there are probably many theories you could test about the model's ability to connect bits of insight together.

[dead]