We've been working on this problem for quite some time in my lab. We released a benchmark piling together several "intrinsic evaluations" that don't require model training. We're currently investigating correlations between performance on this benchmark and on downstream tasks. Here it is - https://github.com/MeLeLBGU/tokenizers_intrinsic_benchmark - where there's a link to the paper we introduced it in, used for checking how inference schemes work together with various token vocabs. It's for English, but cited work and citing work have some of these for other languages as well.