Hacker News

pama 3 days ago [ - ]

Thanks! It seems to me that the performance gain here was due to a smaller vocab size. This type of change is almost guaranteed to backfire for larger models and larger datasets / lower loss requirements, so it probably is not very useful. Generally the historical trend has been to use larger vocabularies as the models got larger themselves.

anonymoushn 3 days ago [ - ]

Well, he says he achieved some downstream win on the same size, but it didn't translate into a win in perplexity, so he tried to do something else. Like I said, it's unfortunate.

anonymoushn 3 days ago [ - ]

I actually wonder if he could just claim a win by calculating validation set BPB for both equally-sized vocabs instead of targeting the same level of perplexity as in the speedrun finish line lol