Not sure what was unexpected about the multi GPU part.

It's very well known that most LLM frameworks including llama.cpp splits models by layers, which has sequential dependency, and so multi GPU setups are completely stalled unless there are n_gpu users/tasks running in parallel. It's also known that some GPUs are faster in "prompt processing" and some in "token generation" that combining Radeon and NVIDIA does something sometimes. Reportedly the inter-layer transfer sizes are in kilobyte ranges and PCIe x1 is plenty or something.

It takes appropriate backends with "tensor parallel" mode support, which splits the neural network parallel to the direction of flow of data, which also obviously benefit substantially from good node interconnect between GPUs like PCIe x16 or NVlink/Infinity Fabric bridge cables, and/or inter-GPU DMA over PCIe(called GPU P2P or GPUdirect or some lingo like that).

Absent those, I've read somewhere that people can sometimes see GPU utilization spikes walking over GPUs on nvtop-style tools.

Looking for a way to break up tasks for LLMs so that there will be multiple tasks to run concurrently would be interesting, maybe like creating one "manager" and few "delegated engineers" personalities. Or simulating multiple different domains of brain such as speech center, visual cortex, language center, etc. communicating in tokens might be interesting in working around this problem.

Theres some technical implementations that makes it more efficient like EXO [1]. Jeff Geerling recently did a review on a 4 MAC Studio cluster with RDMA support and you can see that EXO has a noticeable advantage [2].

[1] https://github.com/exo-explore/exo [2] https://www.youtube.com/watch?v=x4_RsUxRjKU

At this point I'd consider a cluster of top specced Mac Studio's to be worth while in production. I just need to host them properly in a rack and in a co-lo data center.

Honestly, I genuinely can see the value if you want to host something internally for sensitive and important information. I really hope the M5 ultra with matmul accelerators will knock this out of the park. With the way RAM is trending, a Mac Studio cluster will become more enticing.

> Looking for a way to break up tasks for LLMs so that there will be multiple tasks to run concurrently would be interesting, maybe like creating one "manager" and few "delegated engineers" personalities.

This is pretty much what "agents" are for. The manager model constructs prompts and contexts that the delegated models can work on in parallel, returning results when they're done.

> Reportedly the inter-layer transfer sizes are in kilobyte ranges and PCIe x1 is plenty or something.

Not an expert, but napkin math tells me that more often that not this will be in the order of megabytes—not kilobytes—since it scales with sequence length.

Example: Qwen3 30B has a hidden state size of 5120, even if quantized to 8 bits that's 5120 bytes per token. It would pass the MB boundary with just a little over 200 tokens. Still not much of an issue when a single PCIe lane is ~2GB/s.

I think device to device latency is more of an issue here, but I don't know enough to assert that with confidence.

For each token generated, you only send one token’s worth between layers; the previous tokens are in the KV cache.

> Not sure what was unexpected about the multi GPU part. It's very well known that most LLM frameworks including llama.cpp splits models by layers, which has sequential dependency, and so multi GPU setups are completely stalled

Oh, I thought the point of transformers was being able to split the load veritcally to avoid seqential dependancies. Is it true just for training or not at all?

Just for training and processing the existing context (pre fill phase). But when doing inference a token t has to be sampled before t+1 can so it’s still sequential