I’m curious how this technique works, or not, with unified memory architectures such as Apple’s M series. It seems like it’s relying on using overlapping processes to help speed things up, but I would assume that having everything unified in main memory such that you don’t have to transfer everything back and forth to the GPU would also have some advantages. Can someone wiser explain this to me?

For FP16-native training of 100B+ models, you will probably still be offloading to swap unless you've got a $150,000 RDMA Mac Studio cluster. The workload would be deeply compute-constrained if you could fit it in-memory anyways.