I think you can do this fairly efficiently with SSE for x86 - SSE/AVX has shift and shuffle. Encoding/Decoding packed data might even be faster this way.

I'm not familiar with RISC-V but from what I've seen here, they're also trying to solve this similarly with vector or bit extraction instructions.

Yes because unaligned load is no problem with SSE/AVX. On my RISC-V OrangePi unaligned vector loads beyond byte-granularity fault so you have to take extra care.

AVX shift and shuffle is mostly limited to 128 bits unfortunately for historical reasons (even for 256-bit instructions) and hardware support for AVX512/AVX10 where they fixed that is a complete mess so it's hard to rely on when you care about backwards compatibility for consumer devices, e.g. in game development.

RISC-V vector has excellent mask/shuffle/permute but the performance in real silicon can be... questionable. See the timings for vrgather here for example: https://camel-cdr.github.io/rvv-bench-results/spacemit_a100/...

For working with packed data structures where fields are irregular/non-predictable/dependent on previous fields etc. unaligned load/store is a godsend. Last time I worked on a custom DB engine that used these patterns the generated x86 code was so much nicer than the one for our embedded ARM cores.