If you understand "cost-effective" to mean the same thing as "feasible with today's tech", maybe. As in, if we feed it all the raw data, we'd need more powerful, expensive devices and they would take years or decades to complete any training on the raw data set.

But without it being done, it's an unproven hypothesis at best.

It wouldn't take decades or years of compute to train a language model that doesn't tokenize text first. It's not an 'unproven hypothesis' because it's already been done. It's just a good deal more cost effective to tokenize so those exercises aren't anything more than research novelty.

It did not sound like that's the only preprocessing step, but even with that, how "costly" would that be for a model comparable to ChatGPT 4 or 5?

Also, the comment was not related to LLMs only.

Note that the goal is to get comparable performance, iow to compare like for like.