Here is a repro using GCE's C4A Axion instances (c4A-highcpu-72). Seems to beat Graviton? Maybe the title of the thread can be updated to a larger number :) ? I used the largest instance to avoid noisy neighbor issues.
$ ./out/bytepack_eval
Bytepack Bench — 16 KiB, reps=20000 (pinned if available)
Throughput GB/s
K NEON pack NEON unpack Baseline pack Baseline unpack
1 94.77 84.05 45.01 63.12
2 123.63 94.74 52.70 66.63
3 94.62 83.89 45.32 68.43
4 112.68 77.91 58.10 78.20
5 86.96 80.02 44.32 60.77
6 93.50 92.08 51.22 67.20
7 87.10 79.53 43.94 57.95
8 90.49 92.36 68.99 83.88
Oh nice! Axion C4A and Graviton4 use the same core (Neoverse V2), so the performance difference is due to factors like clock speed and power management.
I used a geometric mean to calculate the top-line "86 GB/s" for NEON pack/unpack; so that's 91 GB/s for the C4A repro. Probably going to leave the title unmodified.