I was impressed by the lack of dominance of Thunderbolt:
"Next I tested llama.cpp running AI models over 2.5 gigabit Ethernet versus Thunderbolt 5"
Results from that graph showed only a ~10% benefit from TB5 vs. Ethernet.
Note: The M3 studios support 10Gbps ethernet, but that wasn't tested. Instead it was tested using 2.5Gbps ethernet.
If 2.5G ethernet was only 10% slower than TB, how would 10G Ethernet have fared?
Also, TB5 has to be wired so that every CPU is connected to every other over TB, limiting you to 4 macs.
By comparison, with Ethernet, you could use a hub & spoke configuration with a Ethernet switch, theoretically letting you use more than 4 CPUs.
10G Ethernet would only marginally speed things up based on past experience with llama RPC; latency is much more helpful but still, diminishing returns with that layer split.
This Video tests the setup using 10Gbps ethernet: https://www.youtube.com/watch?v=4l4UWZGxvoc
That’s llama, which didn’t scale nearly as well in the tests. Assumedly because it’s not optimized yet.
RDMA is always going to have lower overhead than Ethernet isn’t it?
Possibly RDMA over thunderbolt. But for RoCE (RDMA over converged Ethernet) obviously not because it's sitting on top of Ethernet. Now that could still have a higher throughput when you factor in CPU time to run custom protocols that smart NICs could just DMA instead, but the overhead is still definitively higher
what do you think "ethernet's overhead" is?
Header and FCS, interpacket gap, and preamble. What do you think "Ethernet overhead" is?
I've meant in usec, sorry if that wasn't clear, given that the discussion that I've replied was about rpc latency.
That's a very nebulous metric. Usec of overhead depends on a lot of runtime things and a lot of hardware options and design that I'm just not privy to