Benchmarks are here: https://kyuz0.github.io/amd-strix-halo-vllm-toolboxes/
Would love to see DeepSeek V4 flash/pro and MiniMax M3 benchmarks but already these are pretty impressive, first strix Halo setup I've seen with some serious performance.
EDIT: Apologies - I think I misunderstood these benchmarks - it seems this is actually very slow when compared to a M4 or M5 chip with a good amount of memory. Looking at the creators video here: https://youtu.be/Cfl3TS7ME5s?t=734 -- it seems the performance of strix halo is much much slower than I get on my M4 MBP - which gets ~400 prefill and ~20 tok/s generation
They are heavily bogged down by bandwidth unfortunately. The macs are on another level. If Apple decides to release AI dedicated hardware, it would dominate this space (consumer AI).
The pp speeds are really slow (50), I think there‘s room for improvement still.
Ah yea after watching one of the creators youtube videos I realize these benchmarks are combining prefill and decode which isn't super helpful - it seems this struggles with the exact same bottlenecks as all strix halo setups, memory bandwidth. It seems this is still significantly slower than equivalent memory sizing on Mac hardware.
How are the memory bandwidths specs of Macbooks vs this?
I looked it up: 512 GB/s for the two node AMD cluster, Macbook Pro with M5 CPU has 153 GB/s. But you can get faster Macs with M5 Pro or M5 Max.
Strix halo I believe is 256GB/s max memory bandwidth and M5 Max is 614GB/s - M3 Ultra is up to 800gb/s
The apple silicon chips basically beat everything in bandwidth. Highest amount of memory controllers (i.e. channels) for a given capacity. That's the main party trick.