Hacker News

Yes because unaligned load is no problem with SSE/AVX. On my RISC-V OrangePi unaligned vector loads beyond byte-granularity fault so you have to take extra care.

AVX shift and shuffle is mostly limited to 128 bits unfortunately for historical reasons (even for 256-bit instructions) and hardware support for AVX512/AVX10 where they fixed that is a complete mess so it's hard to rely on when you care about backwards compatibility for consumer devices, e.g. in game development.

RISC-V vector has excellent mask/shuffle/permute but the performance in real silicon can be... questionable. See the timings for vrgather here for example: https://camel-cdr.github.io/rvv-bench-results/spacemit_a100/...

For working with packed data structures where fields are irregular/non-predictable/dependent on previous fields etc. unaligned load/store is a godsend. Last time I worked on a custom DB engine that used these patterns the generated x86 code was so much nicer than the one for our embedded ARM cores.