Since the cpu always accesses a full cache line (64 bytes) at a time, you might as well search the entire cache line (it’s practically free once the data is on-cpu). So I’d like to try a ‘binary’ search that tests all the values in the ‘middle cache line’ and then chooses to go left or right if none match. You can do the cache line search as a single 512bit simd instruction. A cache line is 64 bytes (or 32 16-bit integers); such a search might well be almost 32 times faster than simple binary search; at least it’ll do 32x less memory accesses, which will dominate in most realistic programs.
Searching the upper cache lines in your binary search tree (sorted vector) for your target is unlikely to yield results. Instead you want to use the extra data in the line to shorten the search, which leads you to a B-Tree or B+tree.
For 4 byte keys and 4 byte child pointers (or indexes in to an array) your inner nodes would have 7 keys, 8 child pointers and 1 next pointer, completely filling a 64 byte cache-line and your tree depth for 1 million entries would go down from ~20 to ~7, the top few levels of which are likely to remain cache resident.
With some thought, it's possible to use SIMD on B-tree nodes to speed up the search within the node, but it's all very data dependent.
Binary searching a sorted array is isomorphic to a sorted binary tree with implicit child pointers.
It seems to me like there should be a sort order that stores the items as a fully-dense left-shifted binary tree from top-to-bottom (e.g. like the implicit heap in an in-place heap sort, but a binary search tree instead of a hea). Is there a name for this? Does it show any performance wins in practice?
There's Eytzinger order: https://algorithmica.org/en/eytzinger
Thanks for that name; that's the exact layout I was considering.
See also https://arxiv.org/abs/1509.05053