Seems like you can use the core intuition here of SIMD comparison of multiple elements on more than just the terminal scale.

The outline would be:

a) use a gather to grab multiple elements from 16 evenly spaced locations

b) compare these in parallel using a SIMD instruction

c) focus in on the correct block

d) if the block is small revert to linear search, else repeat the gather/compare cycle

Even though the gather instruction is reading from non-contiguous memory and reading more than you normally would need to with a binary sort, enabling a multiway compare and collapsing 4 levels of binary search should be a win on large tables.

You also may not be able to do a full 16-way comparison for all data types. Searching for float64 will limit you to 8 way comparisons (with AVX-512) but int32, float32 will allow 16 way comparisons.

Gather is extremely slow. Anyone aiming for efficiency will avoid gathers.

I bet you a binary search is in fact faster than any gather based methodology.