Shameless plug, you may wish to do Lucene-style tokenizing using the Unicode standard: https://github.com/clipperhouse/uax29/tree/master/words

Got to admit, initial impressions, this is pretty neat, would spend sometime with this. Thanks for the link :)