Two-character SIMD filtering improved performance significantly:
ClickBench query Q20 sped up by 35%
Other queries which perform substring matching saw an overall improvement of ~10%
The geometric mean of all queries improved by 4.1%
ClickBench dataset is ~70G IIRC so I find it interesting that they measured such a substantial speedup while only using SSE4.1 (128-bit) - so, not even AVX2 and much less AVX-512. I wonder what the results would be if latter had been the case.And I also wonder if this is (partly) an artifact of more laser-focused utilization of a CPU core ALU and memory subsystem. E.g. crunching more work into a single or pair of instructions are now leaving more space for other unrelated instructions to be retired.