I was surprised at previous comparison on omarchy website, because apple m* work really well for data science work that don't require GPU.
It may be explained by integer vs float performance, though I am too lazy to investigate. A weak data point, using a matrix product of N=6000 matrix by itself on numpy:
- SER 8 8745, linux: 280 ms -> 1.53 Tflops (single prec)
- my m2 macbook air: it is ~180ms ms -> ~2.4 Tflops (single prec)
This is 2 mins of benchmarking on the computers I have. It is not apple to orange comparison (e.g. I use the numpy default blas on each platform), but not completely irrelevant to what people will do w/o much effort. And floating point is what matters for LLM, not integer computation (which is what the ruby test suite is most likely bottlenecked by)
It's all about the memory bandwidth.
Apple M chips are slower on the computation that AMD chips, but they have soldered on-package fast ram with a wide memory interface, which is very useful on workloads that handle lots of data.
Strix halo has a 256-bit LPDDR5X interface, twice as wide as the typical desktop chip, roughly equal to the M4 Pro and half of that of the M4 Max.
You're most likely bottlenecked by memory bandwidth for a LLM.
The AMD AI MAX 395+ gives you 256GB/sec. The M4 gives you 120GB/s, and the M4 Pro gives you 273GB/s. The M4 Max: 410GB/s (14‑core CPU/32‑core GPU) or 546GB/s (16‑core CPU/40‑core GPU).
It’s both. If you’re using any real amount of context, you need compute too.
Yeah, memory bandwidth is often the limitation for floating point operations.