Cool exercise and thank you for the blog post.

I did a similar thing (for fun) for the tokenizer associated to a Swift derivates language written in C++.

My approach was however very different of yours:

- No macro, no ASM, just explicit vectorization using std.simd

- No hand rolled allocator. Just std::vector and SOA.

- No hashing for keyword. They are short. A single SIMD load / compare is often enough for a comparison

- All the lookup tables are compile time generated from the token list using constexpr to keep the code small and maintainable.

I was able to reach around 8 Mloc/s on server grade hardware, single core.