Hacker News

Co-author of the PathPiece paper here.

With regard to weighting the n-grams by length*frequency, I'm not sure it is clear that that would be better. The SentencePiece unigram model does it that way (as I mentioned in another comment), and hence, unigram produces longer tokens on average. It is generally considered that this is a bit of an issue with unigram. Not that there is particular evidence either way, as with many things in tokenization.

Why do you think 2^18 initial n-grams is too few? That's 5.3 times more than the largest vocab we train.