I was baffled by the comparison to the M4 Max. Does this mean that recent AMD chips will be performing at the same level, and what does that mean for on-device LLMs? .. or am I misunderstanding this whole ordeal?
I was baffled by the comparison to the M4 Max. Does this mean that recent AMD chips will be performing at the same level, and what does that mean for on-device LLMs? .. or am I misunderstanding this whole ordeal?
Yes, the Strix series of AMD uses a similar architecture as M series with massive memory bandwidth and big caches.
That results in significantly better performance.
Isn't this the desktop architecture that Torvalds suggested years ago?
I don't know, but it's primarily very expensive to manufacture and hard to make expandable. You can see people in rage due to soldered RAM in this thread.
There's always tradeoffs and people propose many things. Selling those things as a product is another game entirely.
It basically looks like a games console. Its not a conceptually difficult architecture, "what if the GPU and the CPU had the same memory?". Good things indeed.
Faster and bigger SRAM cache is as complicated of a solution as adding moar boosters to your rocket. It works, but expensive. RP2040 uses ~8x more die space as its dual CPU just for the RAM.
Do I misunderstand your message here or are you comparing this desktop machine to an embedded microcontroller from Raspberry Pi Limited?
This how the Amiga worked 40 years ago...
Will we be able to get similar bandwidth with socketed ram with CAMM / LPCAMM modules in the near future?
Maybe, but due to the physics of signal integrity, socketed RAM will always be slower than RAM integrated onto the same PCB as whatever processing element is using it, so by the time CAMM / LPCAMM catches up, some newer integrated RAM solution will be faster yet.
This is a matter of physics. It can't be "fixed." Signal integrity is why classic GPU cards have GiBs of integrated RAM chips: GPUs with non-upgradeable RAM that people have been happily buying for years now.
Today, the RAM requirements of GPU and their applications has become so large that the extra, low cost, slow, socketed RAM is now a false economy. Naturally, therefore, it's being eliminated as PCs evolve into big GPUs, with one flavor or other of traditional ISA processing elements attached.
It’s possible that Apple really did a disservice to soldered RAM by making it a key profit-increasing option for them, exploiting the inability of buyers to buy RAM elsewhere or upgrade later, but in turn making soldered RAM seem like a scam, when it does have fundamental advantages, as you point out.
Going from 64 GB to 128 GB of soldered RAM on the Framework Desktop costs €470, which doesn’t seem that much more expensive than fast socketed RAM. Going from 64 GB to 128 GB on a Mac Studio costs €1000.
Ask yourself this: what is the correct markup for delivering this nearly four years before everyone else? Because that's what Apple did, and why customers have been eagerly paying the cost.
Let us all know when you've computed that answer. I'll be interested, because I have no idea how to go about it.
I had 128gb of ram in my desktop from nearly a decade ago. I'm not sure what exactly Apple invented here.
Yeah, it's not really about jamming more DIMMs into more sockets.
Of course it isn't... the point stands... Apple didn't actually invent anything in that regard.
Is the problem truly down to physics or is it down to the stovepiped and conservative attitudes of PC part manufacturers and their trade groups like JEDEC? (Not that consumers don't play a role here too).
The only essential part of sockets vs solder is the metal-metal contacts. The size of the modules and the distance from the CPU/GPU are all adjustable parameters if the will exists to change them.
> Is the problem truly down to physics
Yes. The "conservative attitudes" of JEDEC et al. are a consequence of physics and the capabilities of every party involved in dealing with it, from the RAM chip fabricators and PCB manufacturers, all the way to you, the consumer, and the price you're willing to pay for motherboards, power supplies, memory controllers, and yield costs incurred trying to build all of this stuff, such that you can sort by price, mail order some likely untested combination of affordable components and stick them together with a fair chance that it will all "work" within the power consumption envelope, thermal envelope, and failure rate you're likely to tolerate. Every iteration of the standards is another attempt to strike the right balance all the way up and down this chain, and at the root of everything is the physics of signal integrity, power consumption, thermals and component reliability.
As I said, consumers play a part here too. But I don't see the causal line from the physics to the stagnation, stovepiping, artificial market segmentation, and cartelization we see in the computer component industries.
Soldering RAM has always been around and it has its benefits. I'm not convinced of its necessity however. We're just now getting a new memory socket form factor but the need was emerging a decade ago.
> The only essential part of sockets vs solder is the metal-metal contacts.
Yeah... And that’s a pretty damn big difference. A connector is always going to result in worse signal integrity than a high-quality solder joint in the real world.
Is that really the long pole in the tent, though?
No doubt the most tightly integrated package can outperform a looser collection of components. But if we could shorten the distances, tighten the tolerances, and have the IC companies work on improving the whole landscape instead of just narrow, disjointed pieces slowly one at a time, then would the unsoldered connections still cause a massive performance loss or just a minor one?
Yes. Signal integrity is so finicky at frequencies DRAM operates that whether you drill the plated holes on boards that complete the circuit to go completely through the board or stop it halfway starts to matter due to signals permeating into the stubs of the holes and reflecting back into the trace causing interference. Adding a connector between RAM and CPU is like extending that long pole in the tent in the middle by inserting a stack of elephant into what is already shaped like an engine crankshaft found in a crashed wreck of a car.
Besides, no one strictly need mid-life upgradable RAMs. You're just wanting to be able to upgrade RAM later after purchase because it's cheaper upfront and also because it leaves less room for supply side for price gouging. Those aren't technical reasons you can't option a 2TB RAM on purchase and be done for 10 years.
In the past, at least, RAM upgrades weren't just about filling in the slots you couldn't afford to fill on day one. RAM modules also got denser and faster over time too. This meant you could add more and better RAM to your system after waiting a couple years than it was even physically possible to install upfront.
Part of the reason I have doubts about the physical necessity here is because PCI Express (x16) is roughly keeping up with GDDR in terms of bandwidth. Of course they are not completely apples-to-apples comparable, but it proves at least that it's possible to have a high-bandwidth unsoldered interface. I will admit though that what I can find indicates that signal integrity is the biggest issue each new generation of PCIe has to overcome.
It's possible that the best solution for discrete PC components will be to move what we today call RAM onto the CPU package (which is also very likely to become a CPU+GPU package) and then keep PCIe x16 around to provide another tier of fast but upgradeable storage.
I am personally dealing with PCIe signal integrity issues at work right now, so I can say yes, it’s incredibly finicky once you start going outside of the simple “slot below CPU” normal situation. And I only care about Gen 3 speeds right now.
But in general yes, PCIe vs RAM bandwidth is like comparing apples to watermelons. One’s bigger than the other and they’re both fruits, but they’re not the same thing.
Generally people don’t talk about random-access PCIe latency because it generally doesn’t matter. You’re looking at a best-case 3x latency penalty for PCIe vs RAM, usually more like an order of magnitude or more. PCIe is really designed for maximum throughput, not minimum latency. If you make the same tradeoffs with RAM you can start tipping the scale the other way - but people really care about random access latency in RAM (almost like it’s in the name) so that generally doesn’t happen outside of specific scenarios. 500ns 16000MT/s RAM won’t sell (and would be a massive pain - you’d probably need to 1.5x bus width to achieve that, which means more pins on the CPU, which means larger packages, which means more motherboard real estate taken and more trace length/signal integrity concerns, and you’d need to somehow convince everyone to use your new larger DIMM...).
You can also add more memory channels to effective double/quadruple/sextuple memory bandwidth, but again, package constraints + signal integrity increases costs substantially. My threadripper pro system does ~340GB/s and ~65ns latency (real world) with 8 memory channels - but the die is huge, CPUs are expensive as hell, and motherboards are also expensive as hell. And for the first ~9 months after release the motherboards all struggled heavily with various RAM configurations.
> The only essential part of sockets vs solder is the metal-metal contacts
And at GHz speeds that matters more than you may think.
Perhaps it's time to introduce L4 Cache and a new Slot CPU design where RAM/L4 is incorporated into the CPU package? The original Slot CPUs that Intel and AMD released in the late 90s were to address similar issues with L2 cache.
How much higher bandwidth, percentage wise, can one expect from integrated DRAM vs socketed DRAM? 10%?
Intel's Arrow Lake platform launched in fall 2024 is the first to support CUDIMMs (clock redriver on each memory module) and as a result is the first desktop CPU to officially support 6400MT/s without overclocking (albeit only reaching that speed for single-rank modules with only one module per channel). Apple's M1 Pro and M1 Max processors launched in fall 2021 used 6400MT/s LPDDR5.
Intel's Lunar Lake low-power laptop processors launched in fall 2024 use on-package LPDDR5x running at 8533MT/s, as do Apple's M4 Pro and M4 Max.
So at the moment, soldered DRAM offers 33% more bandwidth for the same bus width, and is the only way to get more than a 128-bit bus width in anything smaller than a desktop workstation.
Smartphones are already moving beyond 9600MT/s for their RAM, in part because they typically only use a 64-bit bus with. GPUs are at 30000MT/s with GDDR7 memory.
I was surprised at previous comparison on omarchy website, because apple m* work really well for data science work that don't require GPU.
It may be explained by integer vs float performance, though I am too lazy to investigate. A weak data point, using a matrix product of N=6000 matrix by itself on numpy:
This is 2 mins of benchmarking on the computers I have. It is not apple to orange comparison (e.g. I use the numpy default blas on each platform), but not completely irrelevant to what people will do w/o much effort. And floating point is what matters for LLM, not integer computation (which is what the ruby test suite is most likely bottlenecked by)It's all about the memory bandwidth.
Apple M chips are slower on the computation that AMD chips, but they have soldered on-package fast ram with a wide memory interface, which is very useful on workloads that handle lots of data.
Strix halo has a 256-bit LPDDR5X interface, twice as wide as the typical desktop chip, roughly equal to the M4 Pro and half of that of the M4 Max.
You're most likely bottlenecked by memory bandwidth for a LLM.
The AMD AI MAX 395+ gives you 256GB/sec. The M4 gives you 120GB/s, and the M4 Pro gives you 273GB/s. The M4 Max: 410GB/s (14‑core CPU/32‑core GPU) or 546GB/s (16‑core CPU/40‑core GPU).
It’s both. If you’re using any real amount of context, you need compute too.
Yeah, memory bandwidth is often the limitation for floating point operations.
An M4 Max has double the memory bandwidth and should run away with similarly optimized benchmarks.
An M4 Pro is the more appropriate comparison. I don't know why he's doing price comparisons to a Mac Studio when you can get a 64GB M4 Pro Mac Mini (the closest price/performance comparison point) for much less.
> don't know why he's doing price comparisons to a Mac Studio when you can get a 64GB M4 Pro Mac Mini (the closest price/performance comparison point) for much less.
Where?
An M4 Pro Mac Mini is priced higher than the Framework here in Canada...
I think DHH compares them because they are both the latest, top-line chips. I think DHHs benchmarks show that they have different performance characteristics. But DHHs favorite benchmark favors whatever runs native linux and docker.
For local LLM the higher memory bandwith of M4 Max makes it much more performant.
Arstechnica has more benchmarks for non-llm things https://arstechnica.com/gadgets/2025/08/review-framework-des...
After the appstore fight, DHH's favorite is whatever is not Apple lol. TBF it just opened his eyes to alternatives now is happy off that platform.
How long until he clashes with the GPL and discovers the BSDs?
Why would that happen? The GPL doesn't conflict at all with anything 37Signals does nor the Rails ecosystem...
Now, but after listening to podcasts with him I think he's someone who would tackle hard stuff like drivers or DSP, so called math genius level coding as soon as it becomes more accessible for him through AI assisted coding.
There is a chance to build a real MacOS/iOS alternatives without a JVM abstraction layer on top like Android. The reason it didn't happen yet is the GPL firewall around the Linux kernel imo.
What app store fight?
https://37signals.com/podcast/this-again-apple/
Not in perf/watt but perf, yes.
Depends on the benchmark I think. In this case it's probably close. Apple is cagey when it comes to power draw or clock metrics but I believe the M4 max has been seen drawing around 50W in loaded scenarios. Meanwhile, Phoronix clocked the 395+ as drawing an average of 91 watts during their benchmarks. If the performance is ~twice as fast that should be a similar performance per watt. Needless to say it's at least not a dramatic difference the way it was when the M1 came out.
edit: Though the M4 Max may be more power hungry than I'm giving it credit, but it's hard to say because I can't figure out if some of these power draw metrics from random Internet posts actually isolate the M4 itself. It looks like when the GPU is loaded it goes much, much higher.
https://old.reddit.com/r/macbookpro/comments/1hkhtpp/m4_max_...
macs have faster memory access so No, Macs are faster for llms
It's not baffling once you realize TSMC is the main defining factor for all these chips, Apple Silicon is simply not that special in the grand scheme of things.
Why do you think TSMC's production being in Taiwan is basically a national security issue for the U.S. at this point?
> Apple Silicon is simply not that special in the grand scheme of things
Apple Silicon might not be that special from an architecture perspective (although treating integrated GPUs as appropriate for workloads other than low end laptops was a break with industry trends), but it’s very special from an economic perspective. The Apple Silicon unit volumes from iPhones have financed TSMC’s rise to semiconductor process dominance and, it would appear, permanently dethroned Intel.
Apple was just the highest bidder for getting the latest TSMC process. They wouldn't have had a problem getting other customers to buy up that capacity. And Intel's missteps counted for a substantial part of the process dominance you refer to. So I'd argue that Apple isn't that special here either.
Until Apple forced other chip makers to respond, nobody else was making high end phone processors. And their A series processors are competitive with and have transistor counts comparable to most mobile and desktop PC processors (and have for years). So the alternate reality where Apple isn't a TSMC customer means that TSMC is no longer manufacturing several hundred million high transistor count processors per year. In my opinion, it’s pretty likely TSMC isn’t able to achieve or maintain process dominance without that.
Update: To give an idea of the scales involved here, Apple had iPhone revenue in 2024 of about $200B. At an average selling price of $1k, we get 200 million units. Thats a ballpark estimate, they don’t release unit volumes, AFAIK. This link from IDC[1] has the global PC market in 2024 at about 267 million units. Apple also has iPads and Macs, so their unit processor volume is roughly comparable to the entire PC market. But, and this is hugely important: every single processor that Apple ships is comparable in performance (and, thus, transistor counts) to high end PC processors. So their transistor volume probably exceeds the entire PC CPU market. And the majority of it is fabbed on TSMC’s leading process node in any given year.
[1]: https://my.idc.com/getdoc.jsp?containerId=prUS53061925
Exactly. This is why competition is good. Intel really didn't have a reason to push as hard.
I don’t think there is a laptop that comes close to battery life or performance while on battery of m1 macbook pro
I hate apple but there is obviously something special about it
I'm pretty sure many of the Windows laptops with the Qualcomm Snapdragon Elite chip have the same or better battery life and comparable performance in a similar form factor. There are many videos online of comparisons.