What's the bottleneck? Is it serializing to/from pyobjects over and over for the mlops? I thought pytorch was pretty good with this: Tensors are views, the computation graph can be executed in parallel, & you're just calling a bunch of fast linear algebra libraries under the hood, etc.
If it avoids excessive copying & supports parallel computation, surely it's fine?
If your model is small enough where the overhead of python would start dominating the execution time, I mean... does performance even matter that much, then? And if it's large enough, surely the things I mentioned outweigh the costs?
Pytorch started off with an eager execution model. This means that for every kernel you call from python, you have to wait for the kernel to finish and then go back to python to launch the next kernel. torch.compile was introduced to avoid this bottleneck.
Ah, I always forget that there's intermediates that aren't just matrix multiplies in ML.
A single python interpreter stack frame into a 10^4 * 10^4 GEMM C BLAS kernel is not a bottleneck, but calling 10^8 python interpreter stack frames for a pointwise addition broadcast op would be a bottleneck.
Does pytorch overload common broadcast operations though? I was under the impression that it did as well. I guess this is what `torch.compile` attempts to solve?
Yep, this is one issue. There are lots of limitations to what you can compile in this way though and your python code rapidly resembles a lower level language and not just scripting. There are also overheads associated with handling distributed collectives from python, multiprocessing for data loader workers in python and also baked in assumptions in the lower level libraries that introduce overhead if you can't go in and fix them yourself (in which case you could be coding in C++ anyway)
> your python code rapidly resembles a lower level language and not just scripting
I thought the point of numeric processing frameworks&languages in general is that if you can express things as common math equations, then geniuses will go in and implement the hyper-optimal solutions for you because the'yre extremely common. If anything, it should resemble scripting even more, because you want to match the structured way as much as possible, so the 'compiler' (or in this case backend C libraries) can do the lifting for you.
Yeah, that's not reality. You often hear people say that neural nets are just linear algebra. That isn't really true anymore if you're going for peak performance, there's also a lot of data handling (i.e. tensor movement, kv caching) and distributed communication that needs to happen too.
Ah, I see. My foray into ML in recent times mostly concentrated around theoretical models (transformers obviously, but also Mamba, SSM's, etc.) & kernel generation frameworks (such as ThunderKittens and Triton). Not really around the system architecture level.
I've implemented KV caching in C++ and seen it implemented in Python, I see your point.
No large scale training & inference either, that's cool, if the model can't even fit onto a single GPU. I can see how memory communication can become a significant issue, since you'd have to manage that through python if you're managing python kernels. (Though you technically could just throw all the responsibility down to the lower levels yet again... not a good idea & polluting responsibilities though)