You can use modded-nanogpt for testing this sort of change. If you don't want to do anything really weird, you can just train whatever tokenizer you like on the training set, then retokenize the input, then run the existing training code. One person did this earlier this year using a Tokenmonster vocab and got better downstream performance in less train time.

What exact attempt with which Tokenmonster vocab do you refer to? Sometimes it is hard to conclude much from these efforts. For example, having a smaller vocabulary is typically only useful for small models where the compute cost of the softmax layer at the end of the decoder may still factor into the equation for the performance. Fixing the size of the vocabulary while increasing the rest of the model makes this inefficiency disappear.

https://x.com/alexjc/status/1881410039639863622

I agree that it would be more useful to compare vocabs of identical size.

Thanks! It seems to me that the performance gain here was due to a smaller vocab size. This type of change is almost guaranteed to backfire for larger models and larger datasets / lower loss requirements, so it probably is not very useful. Generally the historical trend has been to use larger vocabularies as the models got larger themselves.

Well, he says he achieved some downstream win on the same size, but it didn't translate into a win in perplexity, so he tried to do something else. Like I said, it's unfortunate.

I actually wonder if he could just claim a win by calculating validation set BPB for both equally-sized vocabs instead of targeting the same level of perplexity as in the speedrun finish line lol