Hacker News

og_kalu 2 days ago [ - ]

It wouldn't take decades or years of compute to train a language model that doesn't tokenize text first. It's not an 'unproven hypothesis' because it's already been done. It's just a good deal more cost effective to tokenize so those exercises aren't anything more than research novelty.

necovek 2 days ago [ - ]

It did not sound like that's the only preprocessing step, but even with that, how "costly" would that be for a model comparable to ChatGPT 4 or 5?

Also, the comment was not related to LLMs only.

Note that the goal is to get comparable performance, iow to compare like for like.