There are many approaches around this, the simplest being to treat bytes as tokens (cf: Google's ByT5[1]). Also, BLT[2] from Meta and ByteFormer[3] from Apple.

[1]: https://arxiv.org/abs/2105.13626

[2]: https://arxiv.org/abs/2412.09871

[3]: https://arxiv.org/abs/2306.00238