I was looking into morpheme tokenization approach, but went even more radical with building a semantic primitive tokenizer [1], i.e. kill, killed, killer would all share the same semantic connection and tokens, e.g. [KILL], [KILL, BEFORE], [KILL, SOMEONE].

It’s based on semantic primitives (Wierzbicka NSM) and emoji (the fun idea that got me interested in this in the first place).

So far I’ve tested 6 iterations and it trains and responds well with a 10k vocab, but the grammar came out rougher. Working on 8th iteration, mainly to improve the grammar and language. Turns out the smaller vocab couldn’t be maintained and all improvements get us back in the ballpark of the 32k vocab size. Further testing is still outstanding for this week.

[1] https://github.com/frane/primoji