How does SentencePiece choose the initial vocabulary, which is trimmed down to determine the final vocabulary which has these desirable properties?

It appears to be the top n-grams scored by the product of frequency and length. Including the frequency weighting is a bit nonstandard among ablative methods.

See line 233: https://github.com/google/sentencepiece/blob/master/src/unig...

I would suspect the n-gram counts don't cross pre-token boundaries, but I don't have time to find that in the code right now.

You can cross whitespace boundaries by setting flag `--split-on-whitespace` to false (it's true by default).

https://github.com/google/sentencepiece/blob/master/doc/opti...

Anyone reading this in the future, I meant to say the length weighting is a bit nonstandard. It is usually by frequency. Oops

Just a minor nit: SentencePiece is a library, not a tokenization algorithm. It implements two tokenization algorithms, Unigram and BPE.

BPE builds vocabularies from the base up so I assume you are talking about Unigram which starts with a big vocabulary and trims it.

The details of UnigramLM are here https://arxiv.org/pdf/1804.10959, and the part about vocabulary seeding is Section 3.2.

Basically, it just selects all substrings that appear in the corpus up to a certain length (and then maybe trims it a little by discarding rare substrings or something to reduce the initial size a bit and make things faster).

If the library has two vocabulary learners, only one of which does the described thing, then isn't it unambiguous which implementation within the library the question refers to? And wouldn't it be ambiguous to instead say "how does Unigram do it" without referring to any particular implementation?

Anyway, the paper says "Frequent substrings can be enumerated in O(T) time and O(20T) space with the Enhanced Suffix Array algorithm (Nong et al., 2009)", which is hilariously underspecified, at least in part because a suffix array algorithm isn't a top-k algorithm.