ARMv8A has nice scalar bit (un)packing instructions. I wonder if NEON is really an improvement over those given that ARM cores tend to have few SIMD ports and NEON is just 128 wide.
ARMv8A has nice scalar bit (un)packing instructions. I wonder if NEON is really an improvement over those given that ARM cores tend to have few SIMD ports and NEON is just 128 wide.
I'm assuming you're referring to BFM/EXTR? NEON absolutely improves here.
The core I developed on (Neoverse V2) has 4 SIMD ports and 6 scalar integer ports, however only 2 of those scalar ports support multicycle integer operations like the insert variant of BFM (essential for scalar packing).
More importantly, NEON progresses 16 elements per instruction instead of 1.