I tried to run the benchmark on my M1 Pro macbook, but the "baseline" is written with x86 intrinsics and won't compile.

Are the benchmark results in the README real? (The README itself feels very AI-generated)

Looking at the makefile, it tries to link the x86 SSE "baseline" implementation and the NEON version into the same binary. A real headscratcher!

Edit: The SSE impl gets shimmed via simd-everywhere, and the benchmark results do seem legit (aside from being slightly apples-to-oranges, but that's unavoidable)

Thank you so much for attempting a reproduction! (I posted this on Reddit and most commenters didn't even click the link)

For the baseline you need SIMDe headers: https://github.com/simd-everywhere/simde/tree/master/simde. These alias x86 intrinsics to ARM intrinsics. The baseline is based on the previous State-of-The-Art (https://arxiv.org/abs/1209.2137) which happens to be x86-based; using SIMDe to compile was the highest-integrity way I could think of to compare with the previous SOTA.

Note: M1 chips specifically have notoriously bad small-shift performance, so the benchmark results will be very bad on your machine. M3 partially fixed this, M4 fixed completely. My primary target is server-class rather than consumer-class hardware so I'm not too worried about this.

The benchmark results were cpy-pasted from the terminal. The README prose was AI generated from my rough notes (I'm confident when communicating with other experts/researchers, but less-so with communication to a general audience).

Here is a repro using GCE's C4A Axion instances (c4A-highcpu-72). Seems to beat Graviton? Maybe the title of the thread can be updated to a larger number :) ? I used the largest instance to avoid noisy neighbor issues.

  $ ./out/bytepack_eval
  Bytepack Bench — 16 KiB, reps=20000 (pinned if available)
  Throughput GB/s

  K  NEON pack   NEON unpack  Baseline pack   Baseline unpack
  1  94.77       84.05        45.01           63.12          
  2  123.63      94.74        52.70           66.63          
  3  94.62       83.89        45.32           68.43          
  4  112.68      77.91        58.10           78.20          
  5  86.96       80.02        44.32           60.77          
  6  93.50       92.08        51.22           67.20          
  7  87.10       79.53        43.94           57.95          
  8  90.49       92.36        68.99           83.88

Oh nice! Axion C4A and Graviton4 use the same core (Neoverse V2), so the performance difference is due to factors like clock speed and power management.

I used a geometric mean to calculate the top-line "86 GB/s" for NEON pack/unpack; so that's 91 GB/s for the C4A repro. Probably going to leave the title unmodified.

Super cool!

Pretty sure anyone going into this kind of post about simd would prefer your writing to llm

[deleted]

Maybe this could help you: https://github.com/simd-everywhere/simde/issues/1099

But this project isn't using simd-everywhere. I'd like to reproduce the results as documented in the README

Look at the parent dir. I agree it is a bit confusing

Ah! Yup, that works, I can compile the binary. I get an "Illegal instruction" error when I run it but that's probably just because M1 doesn't support some of the NEON instructions. I retract my implicit AI-slop accusations.

Results from M1 Pro (after setting CPU=native in the makefile): https://gist.github.com/DavidBuchanan314/e3cde76e4dab2758ec4...

[deleted]