I wouldn't use the word necessary.

IMO, we are probably talking about a 6x slow down (for typical english). You would need to be absolutely stupid not to implement some kind of optimisation along these lines.

Slower and maybe a little dumber; But it would work.

Not sure about “dumber” - it may be better than SOTA models at identifying which days of the week contain the letter “d”.

True, it would be better at some tasks.

My thinking is that for most tasks, a byte-orientated LLM still needs something like the wide "single activation per word" formatting that the tokeniser mostly provides. And it will likely waste its first and last few layers implementing a replacement tokeniser (and would probably do a much better job at it). It would also need to decode and encode unicode at the same time.

My estimate is that it might lose about 10% of its weights to these new tasks. Your 80B parameter model becomes as smart as a 72B parameter model - Measurably dumber, but not drastically so.