The top Mac Studio has six thunderbolt 5 ports, each of which is a PCIe 4.0 x4 link. Each is a 8GB/sec link in each direction, which is a lot. Going from x16 down to x4 has less than a 10% hit on games: https://www.reddit.com/r/buildapc/comments/sbegpb/gpu_in_pci...
“In the more common situations of reducing PCI-e bandwidth to PCI-e 4.0 x8 from 4.0 x16, there was little change in content creation performance: There was only an average decrease in scores of 3% for Video Editing and motion graphics. In more extreme situations (such as running at 4.0 x4 / 3.0 x8), this changed to an average performance reduction of 10%.”
Oculink is generally faster than TB5 despite them both using PCIe 4.0, because Oculink provides direct PCIe access whereas Thunderbolt has to route all PCIe traffic through its controller. The benchmarks show that the overhead introduced by the TB5 controller slows down GPU performance.
It's not just the controllers; the Thunderbolt protocol itself imposes different speed limits. The bit rates used by Thunderbolt aren't the same as PCIe, and PCIe traffic gets encapsulated in Thunderbolt packets.
Maybe; I'm unable to find any benchmarks that specifically compare PCs with TB to Macs to test this. But there is certainly still overhead with TB no matter what, and therefore it'll never be as fast as Oculink.
That's just blatantly wrong, the performance loss of GPUs is very well documented and gets worse as you go towards higher end models. We're talking 30%+ loss of performance here.
Sure. And lots of people need all that I/O. But my point is that it’s not like the Mac Studio has no I/O. The outgoing Mac Pro only has 24 total lanes of PCIe 4.0 going to the switch chip that’s connected to all the PCI slots. The advent of externally route PCIe is a development in the last few years that may have factored into the change in form factor.
When people talk about 100gigabit networks for Macs, im really curious what kind of network you run at home and how much money you spent on it. Even at work I’m generally seeing 10gigabit network ports with 100gigabit+ only in data centers where macs don’t have a presence
Local AI is probably the most common application these days.
Apple recently added support for InfiniBand over Thunderbolt. And now almost all decent Mac Studio configurations have sold out. Those two may be connected.
100 Gb/s Ethernet is likely to be expensive, but dual-port 25 Gb/s Ethernet NICs are not much more expensive than dual-port 10 Gb/s NICs, so whenever you are not using the Ethernet ports already included by a motherboard it may be worthwhile to go to a higher speed than 10 Gb/s.
If you use dual-port NICs, you do not need a high-speed switch, which may be expensive, but you can connect directly the computers into a network, and configure them as either Ethernet bridges or IP routers.
I work in media production and I have the same thought constantly. Hell I curse in church as far as my industry is concerned because I find 2.5 to be fine for most of us. 10 absolutely.
100gbps is going to be for mesh networks supporting clusters (4 Mac Studios let's just say) - not for LAN type networks (unless it's in an actual datacenter).
I suppose the throughput is not the key, latency is. When you split ann operation that normally ran within one machine between two machines, anything that crosses the boundary becomes orders of magnitude slower. Even with careful structuring, there are limits of how little and how rarely you can send data between nodes.
I suppose that splitting an LLM workload is pretty sensitive to that.
Things that aren’t graphics cards, such very high bandwidth video capture cards and any other equipment that needs a lot of lanes of PCI data at low latency.
Multiple GPUs was tried, by the whole industry including Apple (most notably with the trash can Mac Pro). Despite significant investment, it was ultimately a failure for consumer workloads like gaming, and was relegated to the datacenter and some very high-end workstations depending on the workload.
Multi-GPU has recently experienced a resurgence due to the discovery of new workloads with broader appeal (LLMs), but that's too new to have significantly influenced hardware architectures, and LLM inference isn't the most natural thing to scale across many GPUs. Everybody's still competing with more or less the architectures they had on hand when LLMs arrived, with new low-precision matrix math units squeezed in wherever room can be made. It's not at all clear yet what the long-term outcome will be in terms of the balance between local vs cloud compute for inference, whether there will be any local training/fine-tuning at all, and which use cases are ultimately profitable in the long run. All of that influences whether it would be worthwhile for Apple to abandon their current client-first architecture that standardizes on a single integrated GPU and omits/rejects the complexity of multi-GPU setups.
The top Mac Studio has six thunderbolt 5 ports, each of which is a PCIe 4.0 x4 link. Each is a 8GB/sec link in each direction, which is a lot. Going from x16 down to x4 has less than a 10% hit on games: https://www.reddit.com/r/buildapc/comments/sbegpb/gpu_in_pci...
Your example uses GTX1080, which is a very old GPU. Current flagship consumer GPU will take a harder hit on low bandwidth PCIE.
Here’s more recent HW: https://www.pugetsystems.com/labs/articles/impact-of-gpu-pci...
This is an RTX4080.
“In the more common situations of reducing PCI-e bandwidth to PCI-e 4.0 x8 from 4.0 x16, there was little change in content creation performance: There was only an average decrease in scores of 3% for Video Editing and motion graphics. In more extreme situations (such as running at 4.0 x4 / 3.0 x8), this changed to an average performance reduction of 10%.”
A 10% performance reduction seems like a lot to be leaving on the table.
Not really.
The article is nearly 3 years old and the 4080 is not even top of the line at the written time.
Still, 10% in difference is still considerable, almost gen-to-gen difference
PCIe 4.0 x4 is going to be a huge bottleneck, even recent SSDs have more throughput (they use PCIe 5.0) never mind GPUs.
Gaming isn't what people are using Mac Studios for. Thunderbolt also isn't a substitute for OCuLink.
Sure, but it’s probably reflective of the fact that GPUs generally aren’t PCIe-bandwidth bound. Also, TB5 and Oculink2 both use PCI 4.0 x4 links.
Oculink is generally faster than TB5 despite them both using PCIe 4.0, because Oculink provides direct PCIe access whereas Thunderbolt has to route all PCIe traffic through its controller. The benchmarks show that the overhead introduced by the TB5 controller slows down GPU performance.
It's not just the controllers; the Thunderbolt protocol itself imposes different speed limits. The bit rates used by Thunderbolt aren't the same as PCIe, and PCIe traffic gets encapsulated in Thunderbolt packets.
Apple Silicon has an integrated thunderbolt controller so that should have less latency than PCs that use a discrete thunderbolt controller.
Many recent laptop CPUs from Intel and AMD have integrated Thunderbolt controllers (i.e. USB 4), so that has not been a difference for a long time.
Maybe; I'm unable to find any benchmarks that specifically compare PCs with TB to Macs to test this. But there is certainly still overhead with TB no matter what, and therefore it'll never be as fast as Oculink.
That's just blatantly wrong, the performance loss of GPUs is very well documented and gets worse as you go towards higher end models. We're talking 30%+ loss of performance here.
Um, I have an M3 Ultra 512GB on my desk for development. Love me some Baldur’s Gate 3, everything turned up to 11…
Yeah 80GB/s total I/O bandwidth is a lot for a Mac, but desktop PCs have been doing 1TB/s (128x PCIe5) for years (Threadripper etc).
Sure. And lots of people need all that I/O. But my point is that it’s not like the Mac Studio has no I/O. The outgoing Mac Pro only has 24 total lanes of PCIe 4.0 going to the switch chip that’s connected to all the PCI slots. The advent of externally route PCIe is a development in the last few years that may have factored into the change in form factor.
- GPU is integrated into the SoC - Surprisingly, it is possible to plug a drive into a TB/USB port
…so what do you actually need PCIe for?
High-end Macs have moved to PCIe 5.0 speeds in their internal drives. Thunderbolt 5 is not fast enough to get the same performance from external ones.
Thunderbolt is also too slow for higher-end networks. A single port is already insufficient for 100-gigabit speeds.
When people talk about 100gigabit networks for Macs, im really curious what kind of network you run at home and how much money you spent on it. Even at work I’m generally seeing 10gigabit network ports with 100gigabit+ only in data centers where macs don’t have a presence
Local AI is probably the most common application these days.
Apple recently added support for InfiniBand over Thunderbolt. And now almost all decent Mac Studio configurations have sold out. Those two may be connected.
> Apple recently added support for InfiniBand over Thunderbolt.
TIL:
* https://developer.apple.com/documentation/technotes/tn3205-l...
Or maybe I forgot:
* https://news.ycombinator.com/item?id=46248644
100 Gb/s Ethernet is likely to be expensive, but dual-port 25 Gb/s Ethernet NICs are not much more expensive than dual-port 10 Gb/s NICs, so whenever you are not using the Ethernet ports already included by a motherboard it may be worthwhile to go to a higher speed than 10 Gb/s.
If you use dual-port NICs, you do not need a high-speed switch, which may be expensive, but you can connect directly the computers into a network, and configure them as either Ethernet bridges or IP routers.
I work in media production and I have the same thought constantly. Hell I curse in church as far as my industry is concerned because I find 2.5 to be fine for most of us. 10 absolutely.
100gbps is going to be for mesh networks supporting clusters (4 Mac Studios let's just say) - not for LAN type networks (unless it's in an actual datacenter).
I suppose the throughput is not the key, latency is. When you split ann operation that normally ran within one machine between two machines, anything that crosses the boundary becomes orders of magnitude slower. Even with careful structuring, there are limits of how little and how rarely you can send data between nodes.
I suppose that splitting an LLM workload is pretty sensitive to that.
To have lots of them plugged together, high end audio cards, electronics integrations, disks with having cables all over the place.
Things that aren’t graphics cards, such very high bandwidth video capture cards and any other equipment that needs a lot of lanes of PCI data at low latency.
but what about second GPU?
Multiple GPUs was tried, by the whole industry including Apple (most notably with the trash can Mac Pro). Despite significant investment, it was ultimately a failure for consumer workloads like gaming, and was relegated to the datacenter and some very high-end workstations depending on the workload.
Multi-GPU has recently experienced a resurgence due to the discovery of new workloads with broader appeal (LLMs), but that's too new to have significantly influenced hardware architectures, and LLM inference isn't the most natural thing to scale across many GPUs. Everybody's still competing with more or less the architectures they had on hand when LLMs arrived, with new low-precision matrix math units squeezed in wherever room can be made. It's not at all clear yet what the long-term outcome will be in terms of the balance between local vs cloud compute for inference, whether there will be any local training/fine-tuning at all, and which use cases are ultimately profitable in the long run. All of that influences whether it would be worthwhile for Apple to abandon their current client-first architecture that standardizes on a single integrated GPU and omits/rejects the complexity of multi-GPU setups.
Video capture
I/O expansion
Networking