One thing thats not addressed here is that the bigger you scale your shared memory cluster the closer to 100% chance that one node fucks up and corrupts your global memory space.
Currently the fastest way to get data from node a to node b is to RDMA it. which means that any node can inject anything into your memory space.
I'm not really sure how Theseus guards against that.
I’m not sure any system prevents RDMA from ruining your day :(
Back in grad school I remember we did something fairly simple but clearly illegal and wedged the machine so bad the out of band management also went down!
> wedged the machine so bad the out of band management also went down!
Now thats living the dream of a shared cluster!
This is hazy now, but I do remember a massive outage of a lustre cluster, which I think was because there was a dodgy node injecting crap into everyone's memory space via the old lustre fast filesystem kernel driver. I think they switched to NFS export nodes after that. (for the render farm and desktops at least.)