>The AMD 395+ uses unified memory, like Apple, so nearly all of it is addressable to be used by the GPU.
This is why they went with the “laptop” cpu. While it’s slightly slower than dedicated memory, it allows you to run the big models, at decent token speeds.
Geerling benchmarked LLM performance on the Framework Desktop and the results look pretty lackluster to me. First, the software seems really immature. He couldn't get ROCm or the NPU working. When he finally got the iGPU working with Vulkan, he could only generate 5 tok/sec with Llama 3.1 70b (40 GB model). That's intolerably slow for anything interactive like coding or chatting imo, but I suppose that's a matter of opinion.
https://github.com/geerlingguy/ollama-benchmark/issues/21
Ryzen AI Max is best with ~100B MoE models rather than large monolithic ones. For example, OpenAI's gpt-oss-120b runs at around 40 tok/s and beats Llama 3.1 70B on most/all benchmarks.
Prompt processing speeds are pretty poor too imo. I was interested in one to be able to run some of the 100b moe's but since they only give 50-150 tk/s (depending on model) it will take 5ish mins to process a 32k context which wouldn't be unbarebly slow for me. I've just looked at the results in that link and it's even worse for the 70b's. It will be nearly 20 mins to process a 32k context.
I guess that depends on your definition of "decent". For the smaller models that can run on a 16/24/32 GB nvidia card, the chip is anywhere between 3x and 10x slower compared to say a 4080 super or a 3090 which are relatively cheap used.
Biggest limitations are the memory bandwidth which limits token generation and the fact it's not a CUDA chip, meaning longer time until first token for theoretically similar hardware specifications.
Any model bigger than what fits in 32 GB VRAM is - in my opinion - currently unusable on "consumer" hardware. Perhaps a tinybox with 144 GB of VRAM and close to 6 TB/s memory bandwidth will get you a nice experience for consumer grade hardware but it's quite the investment (and power draw)
I think it depends on the use case, slow isn’t that bad if you’re asking questions infrequently. I downloaded a model a few weeks ago that was roughly 80GBs in size and ran it on my 3090 just to see how it was… and it was okay. Fast? Nope. But it did it. If the answers were materially better I’d be happy to wait a minute for the output, but they weren’t. I’d like to try one of the really large ones, just to see how slow it is, but need to clear some space to even download it.
unified and soldered :(
I understand it's faster but still...
Did they at least do an internal PSU if they went the Apple way or does it come with a power brick twice the size of the case?
Edit: wait. They do have an internal PSU! Goodness!
For anyone else curious, they actually have a deep dive on the subject. You can also replace it with another since it's just flexATX, albeit with some requirements.
https://community.frame.work/t/framework-desktop-deep-dive-p...
LPDDR5, it looks like this is the only ram type that is going to work with Ryzen AI chips.
Framework offers Ryzen AI laptops with replaceable memory.
That ram is slower than soldered down lpddr5 at the moment.
I'm curious how something like CUDIMM memory would perform under the same workloads.
Currently avoid machines with soldered memory, but if memory can be replaced and still have similar performance, that would change things.
PCI lanes just aren’t there yet which is why the soldered memory approach to shorten the gap and increase the speed. The memory the framework desktop uses doesn’t come in slotted form (LPDDR5X).
You absolutely can (and should) build your own for slightly cheaper. Just find the fastest DDR5 CUDIMMs you can paired with the fastest memory bus mobo.
> it allows you to run the big models, at decent token speeds
Without CUDA, being an AMD GPU. Big warning depending on the tools you want to use.
There's solutions for that though.
https://docs.scale-lang.com/stable/
https://github.com/vosen/ZLUDA
It's not perfect but it's a start towards unification. In the end though, we're at the same crossroads that graphics drivers were in 2013 with the sunsetting of OpenGL by Apple and the announcement of Vulkan by Khronos Group. CUDA has been around for a while and only recently has it gotten attention from the other chip makers. Thank goodness for open source and the collective minds that participate.