Cool exercise and thank you for the blog post.
I did a similar thing (for fun) for the tokenizer associated to a Swift derivates language written in C++.
My approach was however very different of yours:
- No macro, no ASM, just explicit vectorization using std.simd
- No hand rolled allocator. Just std::vector and SOA.
- No hashing for keyword. They are short. A single SIMD load / compare is often enough for a comparison
- All the lookup tables are compile time generated from the token list using constexpr to keep the code small and maintainable.
I was able to reach around 8 Mloc/s on server grade hardware, single core.