Apologies if this is a dumb question, but is there no "hello world"-ish sandbox for testing this theory? I can very easily imagine that trying to go head-to-head with R1 or such is going to be a boatload of GPU, but for just testing tokenizer head-to-head isn't there a smaller sized one that can be used in a bake off?

We've been working on this problem for quite some time in my lab. We released a benchmark piling together several "intrinsic evaluations" that don't require model training. We're currently investigating correlations between performance on this benchmark and on downstream tasks. Here it is - https://github.com/MeLeLBGU/tokenizers_intrinsic_benchmark - where there's a link to the paper we introduced it in, used for checking how inference schemes work together with various token vocabs. It's for English, but cited work and citing work have some of these for other languages as well.

To expand on the other comment, if you look under the data folder in nanoGPT, you can see examples of how to train the model using various data sources and encoders. "shakespeare_char" is probably the most rudimentary, only converting the characters of the input into integers.

e.g. https://github.com/karpathy/nanoGPT/blob/master/data/shakesp...

You can use modded-nanogpt for testing this sort of change. If you don't want to do anything really weird, you can just train whatever tokenizer you like on the training set, then retokenize the input, then run the existing training code. One person did this earlier this year using a Tokenmonster vocab and got better downstream performance in less train time.

What exact attempt with which Tokenmonster vocab do you refer to? Sometimes it is hard to conclude much from these efforts. For example, having a smaller vocabulary is typically only useful for small models where the compute cost of the softmax layer at the end of the decoder may still factor into the equation for the performance. Fixing the size of the vocabulary while increasing the rest of the model makes this inefficiency disappear.

https://x.com/alexjc/status/1881410039639863622

I agree that it would be more useful to compare vocabs of identical size.

Thanks! It seems to me that the performance gain here was due to a smaller vocab size. This type of change is almost guaranteed to backfire for larger models and larger datasets / lower loss requirements, so it probably is not very useful. Generally the historical trend has been to use larger vocabularies as the models got larger themselves.

Well, he says he achieved some downstream win on the same size, but it didn't translate into a win in perplexity, so he tried to do something else. Like I said, it's unfortunate.

I actually wonder if he could just claim a win by calculating validation set BPB for both equally-sized vocabs instead of targeting the same level of perplexity as in the speedrun finish line lol