Thank you so much for attempting a reproduction! (I posted this on Reddit and most commenters didn't even click the link)
For the baseline you need SIMDe headers: https://github.com/simd-everywhere/simde/tree/master/simde. These alias x86 intrinsics to ARM intrinsics. The baseline is based on the previous State-of-The-Art (https://arxiv.org/abs/1209.2137) which happens to be x86-based; using SIMDe to compile was the highest-integrity way I could think of to compare with the previous SOTA.
Note: M1 chips specifically have notoriously bad small-shift performance, so the benchmark results will be very bad on your machine. M3 partially fixed this, M4 fixed completely. My primary target is server-class rather than consumer-class hardware so I'm not too worried about this.
The benchmark results were cpy-pasted from the terminal. The README prose was AI generated from my rough notes (I'm confident when communicating with other experts/researchers, but less-so with communication to a general audience).
Here is a repro using GCE's C4A Axion instances (c4A-highcpu-72). Seems to beat Graviton? Maybe the title of the thread can be updated to a larger number :) ? I used the largest instance to avoid noisy neighbor issues.
Oh nice! Axion C4A and Graviton4 use the same core (Neoverse V2), so the performance difference is due to factors like clock speed and power management.
I used a geometric mean to calculate the top-line "86 GB/s" for NEON pack/unpack; so that's 91 GB/s for the C4A repro. Probably going to leave the title unmodified.
Super cool!
Pretty sure anyone going into this kind of post about simd would prefer your writing to llm