This doesn't do the same thing though, since it's not Unicode aware.
>>> 'x\u2009 a'.split()
['x', 'a']
# incorrect; in bytes mode, `\S` doesn't know about unicode whitespace
>>> list(re.finditer(br'\S+', 'x\u2009 a'.encode()))
[<re.Match object; span=(0, 4), match=b'x\xe2\x80\x89'>, <re.Match object; span=(7, 8), match=b'a'>]
# correct, in unicode mode
>>> list(re.finditer(r'\S+', 'x\u2009 a'))
[<re.Match object; span=(0, 1), match='x'>, <re.Match object; span=(5, 6), match='a'>]
There's bound to be a way to turn a stream of bytes into a stream of unicode code points (at least I think that's what python is doing for strings). Though I'm explicitly not volunteering to write the code for it.
Sure, but making one string from the file contents is surely much better than having a separate string per word in the original data.
... Ah, but I suppose the existing code hasn't avoided that anyway. (It's also creating regex match objects, but those get disposed each time through the loop.) I don't know that there's really a way around that. Given the file is barely a KB, I rather doubt that the illustrated techniques are going to move the needle.
In fact, it looks as though the entire data structure (whether a dict, Counter etc.) should a relatively small part of the total reported memory usage. The rest seems to be internal Python stuff.
OP's .split_ascii() doesn't handle U+2009 as well.
edit: OP's fully native C++ version using Pystd
Hmm? Which code are you looking at?