The underlying problem here is real and legitimately difficult. Shunting data around a cluster (ideally as parts of it fall over) to minimise overall time, in an application independent fashion, is a definable dataflow problem and also a serious discrete optimisation challenge. The more compute you spend on trying to work out where to move the data around, the less you have left over for the application. Also tricky working out what the data access patterns seem to be. Very like choosing how much of the runtime budget to spend on a JIT compiler.

This _should_ breakdown as running optimised programs on their runtime makes things worse and running less-carefully-structured ones makes things better, where many programs out there turn out to be either quite naive or obsessively optimised for an architecture that hasn't existed for decades. I'd expect this runtime to be difficult to build but with high value on success. Interesting project, thanks for posting it.