Hacker News

https://llm-tracker.info/_TOORG/Strix-Halo has very comprehensive test results for running llama.cpp with Strix Halo. This one is particularly interesting:

> But when we switch to longer context, we see something interesting happen. WMMA + FA basically loses no performance at this longer context length!

> Vulkan + FA still has better pp but tg is significantly lower. More data points would be better, but seems like Vulkan performance may continue to decrease as context extends while the HIP+rocWMMA backend should perform better.

lhl has also been sharing these test results in https://forum.level1techs.com/t/strix-halo-ryzen-ai-max-395-..., and his latest comment provides a great summary of the current state:

> (What is bad is that basically every single model has a different optimal backend, and most of them have different optimal backends for pp (handling context) vs tg (new text)).

Anyway, for me, the greatest thing about the Strix Halo + llama.cpp combo is that we can throw one or more egpu into the mix, as echoed by level1tech video (https://youtu.be/ziZDzrDI7AM?t=485), which should help a lot with PP performance.