Hacker News

If labs provided the corpus and source code for training their tokenizers, it would be a lot easier to produce results about tokenizers. As it is, they provide neither, so it is impossible to compare different algorithms running on the same data if you also want to include the vocabs that are commonly used.