Here is a repro using GCE's C4A Axion instances (c4A-highcpu-72). Seems to beat Graviton? Maybe the title of the thread can be updated to a larger number :) ? I used the largest instance to avoid noisy neighbor issues.

  $ ./out/bytepack_eval
  Bytepack Bench — 16 KiB, reps=20000 (pinned if available)
  Throughput GB/s

  K  NEON pack   NEON unpack  Baseline pack   Baseline unpack
  1  94.77       84.05        45.01           63.12          
  2  123.63      94.74        52.70           66.63          
  3  94.62       83.89        45.32           68.43          
  4  112.68      77.91        58.10           78.20          
  5  86.96       80.02        44.32           60.77          
  6  93.50       92.08        51.22           67.20          
  7  87.10       79.53        43.94           57.95          
  8  90.49       92.36        68.99           83.88

Oh nice! Axion C4A and Graviton4 use the same core (Neoverse V2), so the performance difference is due to factors like clock speed and power management.

I used a geometric mean to calculate the top-line "86 GB/s" for NEON pack/unpack; so that's 91 GB/s for the C4A repro. Probably going to leave the title unmodified.