If you have an array of numbers with a known upper-bound, such as enums with 8 possible values (representable with 3 bits), and a memory-bound operation on those numbers eg, for (int i; i < n; i++) if (user_category[i] == 0) filtered.push_pack(i), which is common in data warehouses, using my code can more than 2x performance by allowing more efficient usage of the DRAM<->CPU bus.