I don't understand. Since it's not using the parallel interface, only one operation can happen at a time. This would be, literally, sequential execution with extra overhead, in this case. Again, in this case, what would hope to be achieved from doing things lazily, since the lazy operations would immediately be followed by their evaluation?
The parallel interface, which is async, is probably what you're lookin for.
Let's look at the subtraction in this case.
If evaluation is lazy, then the subtraction operator gets fed two unevaluated matrix multiplies.
If it's a dumb subtraction operator, this gives us no benefit. Eventually it evaluates both and then subtracts. And it has some extra overhead like you said.
But if it's a smart subtraction operator, it can realize that both parameters are the same equation, and then it can return all 0s without evaluating anything.
And even better than just skipping the matrix math, "all 0s" can be a stub object that takes O(1) time to set up. And then .abs().max() will be instant too.
I see now, thank you. I was stuck on the "lazy evaluation" part, rather than the optimization part they were actually suggesting.
The Python commands are encountered sequentially. One could image a library where the Python commands build the computation under the hood. Then, the library would be able to take advantage of situations like this one (or, more practically, reorder multiplications and/or avoid unnecessary temporaries).