UTF-8 should be a universal tokenizer