It wouldn't take decades or years of compute to train a language model that doesn't tokenize text first. It's not an 'unproven hypothesis' because it's already been done. It's just a good deal more cost effective to tokenize so those exercises aren't anything more than research novelty.
It did not sound like that's the only preprocessing step, but even with that, how "costly" would that be for a model comparable to ChatGPT 4 or 5?
Also, the comment was not related to LLMs only.
Note that the goal is to get comparable performance, iow to compare like for like.