> Neural accelerators to get prompt prefill time down.
Apple Neural Engine is a thing already, with support for multiply-accumulate on INT8 and FP16. AI inference frameworks need to add support for it.
> this setup can support up to 4 Mac devices because each Mac must be connected to every other Mac!!
Do you really need a fully connected mesh? Doesn't Thunderbolt just show up as a network connection that RDMA is ran on top of?
> Do you really need a fully connected mesh? Doesn't Thunderbolt just show up as a network connection that RDMA is ran on top of?
If you daisy chain four nodes, then traffic between nodes #1 and #4 eat up all of nodes #2 and #3's bandwidth, and you eat a big latency penalty. So, absent a switch, the fully connected mesh is the only way to have fast access to all the memory.
Obviously don't daisy chain, that wastes ports so badly. But if you connect 4 nodes into a loop, it goes fine. Relaying only adds 33% extra traffic. And what specifically are the latency numbers you have in mind?
If you have 3 links per box, then you can set up 8 nodes with a max distance of 2 hops and an average distance of 1.57 hops. That's not too bad. It's pretty close to having 2 links each to a big switch.
Can’t you make bandwidth reservations and optimise data location to prefer comms between directly connected nodes over one or two-hop paths?
Sure, one could think of some kind of pipeline parallelism where you only need a fast transfer to the next step in the model and that would boost throughput but not increase model size.
Might be helpful if they actually provided a programming model for ANE that isn't onnx. ANE not having a native development model just means software support will not be great.
onnx supports CoreML, is that how?
They were talking about neural accelerators (a silicon piece on GPU): https://releases.drawthings.ai/p/metal-flashattention-v25-w-...
> Apple Neural Engine is a thing already, with support for multiply-accumulate on INT8 and FP16. AI inference frameworks need to add support for it.
Or, Apple could pay for the engineers to add it.
Apple already paid software engineers to add Tensorflow support for the ANE hardware.
How much of an improvement can be expected here? It seems to me that in general most potential is pretty quickly realized on Apple platforms.