Aren’t Unicode characters generally treated as 2 tokens to avoid a huge vocabulary?