Hacker News

I have been doing experiments along these lines. Initially I have avoided the issue with the semantics of words altogether because there is plenty of work to be done before you even get to the words themselves.

With BPE you get different encodings based upon punctuation and whitespace whenever it decides to pair the characters before and after a word into the word itself.

The tokenizer I have been using breaks words into components of:

   prefix character
   prefix character repeat count
   index of lowercase(word)
   index of capialization mode  (0 = all lower, 1 = initial cap, 2= all caps)
   suffix character

The encoder encodes the character before and after the word double encoding the information into the words previous and next, the decoder decreases the prefix count by one it it matches the suffix of the previous word. Anything that doesn't match a known word or is weirdly capitalized gets character encoded as a series of prefix-suffix characters with no word body. (a separate channel for chars would probably be better)

Each channel gets their own embedding table and the embeddings from all the channels are summed before being passed into the transformer.

Decode is just a separate last stage layer translating the final embedding into channels. In hindsight it should have been a little more than that for the decode because if the options for the next word were split between " YELLED!" and " whispered." my current system could theoretically produce " WHISPERED.". In practice it doesn't seem to do that much, but that means it's had to learn something to deal with that (I suspect by limiting variation) adding a little smarts to the end tokenization would help, perhaps choose the word index first then use the embedding for the match to filter the predicted embedding before calculating the other channels.

I have not yet done anything on breaking words themselves up. I have been using tinystories for training so there hasn't been need for it with so few unique words. I have given it it a lot of thought though and I think I would contest the 'Gold standard' encodings. I think a word like 'nodular' should be encoded as something like 'nodule' '<of or relating to modifier> <minus e> ar'

It's a little hard to comprehend what this might be for other languages, but I think there is probably insights to be had if you tried to make something that encoded English, Latin and Kanji equally well.

I'd be curious to know what the total number of homonyms there are across languages, too. Just a single signal to say 'this one is a known homonym' would probably be beneficial. If the total number is low enough, having their own token range might work too.