Isn't it surprising that there were enough pre-1930 tokens to train an intelligent model? I was always under the impression that many tokens are also necessary to force the model to grok things and compress its learning into a somewhat intelligent model of the world, so to say. But perhaps I'm underestimating how much digitized literature exists from then.

one of my greatest hopes for the advancement of LLM technology is a great reduction for the amount of data to train on. imagine a SOTA model trained exclusively on good prose, ah.