I don't follow (possibly through my own limitations) the main argument.
> The center piece of my argument is that the brute force learning approaches that everyone rightfully touts as great achievements relied on case-specific very carefully engineered front-ends to extract the right data from the cacophony of raw signals that the real-world presents.
In nearly each of the preceding examples, isn't the argument really about the boundaries that define the learning machine? Just because data preparation / formatting / sampling / serialization is more cost-effective to do externally from the learning machine, doesn't mean that boundary is necessary. One could build all of this directly inside the boundary of the learning machine and feed it the raw, messy, real world signals.
Also, humans having plentiful learning aids doing "tokenization", as anyone who helped a child learn to count has experienced first hand.
If you understand "cost-effective" to mean the same thing as "feasible with today's tech", maybe. As in, if we feed it all the raw data, we'd need more powerful, expensive devices and they would take years or decades to complete any training on the raw data set.
But without it being done, it's an unproven hypothesis at best.
It wouldn't take decades or years of compute to train a language model that doesn't tokenize text first. It's not an 'unproven hypothesis' because it's already been done. It's just a good deal more cost effective to tokenize so those exercises aren't anything more than research novelty.
It did not sound like that's the only preprocessing step, but even with that, how "costly" would that be for a model comparable to ChatGPT 4 or 5?
Also, the comment was not related to LLMs only.
Note that the goal is to get comparable performance, iow to compare like for like.