A beautiful algorithm.

Would there be any value in using simd to check the whole cache line that you fetch for exact matches on the narrowing phase for an early out?