Hacker News

cschmidt 2 days ago [ - ]

It appears to be the top n-grams scored by the product of frequency and length. Including the frequency weighting is a bit nonstandard among ablative methods.

See line 233: https://github.com/google/sentencepiece/blob/master/src/unig...

I would suspect the n-gram counts don't cross pre-token boundaries, but I don't have time to find that in the code right now.

mcyc 2 days ago [ - ]

You can cross whitespace boundaries by setting flag `--split-on-whitespace` to false (it's true by default).

https://github.com/google/sentencepiece/blob/master/doc/opti...

cschmidt a day ago [ - ]

Anyone reading this in the future, I meant to say the length weighting is a bit nonstandard. It is usually by frequency. Oops