Hacker News

epistasis 3 days ago [ - ]

This is hell for a lot of ML containers, that have gigabytes of CUDA and PyTorch. Before at least you could keep your code contained to a layer. But if I understand this correctly every code revision duplicates gigabytes of the same damn bloated crap.

a_t48 3 days ago [ - ]

It's even worse when you end up installing PyTorch as a separate package in some other layer. It's not shared between layers at all with regular Docker.

spwa4 3 days ago [ - ]

If you have problems with 13 (I believe) GB of docker layers ... how do you deal with terabytes or petabytes of AI training data?

epistasis 3 days ago [ - ]

Petabytes of training data is only one application of PyTorch, which is going to use tens of thousands of containers, but...

Inference, development cycles, any of the application domains of PyTorch that don't involve training frontier models... all of those are complicated by excessive container layers.

But mostly dev really sucks with writing out an extra 10GB for a small code change.

a_t48 3 days ago [ - ]

Going to self promote one last time here - I've built a fix for this, at least for the registry/image export side, at https://clipper.dev. Docker(Hub) can't share large files between layers, but I can.

StableAlkyne 3 days ago [ - ]

You don't even need MB of training data for some ML applications. AI is the sexy thing nowadays, but neural networks (Torch is a NN library) are generally useful for even small regression and clarification problems.

For some problems you might even be able to get away with single digit numbers of training points (classic example of this regime being Physics-Informed Neural Networks)

nijave 3 days ago [ - ]

Yeah, our handful of models we just commit to the git repo--usually only a few MB.

Image still ends up being like 6-8Gi tho. iirc pytorch had a hard dependency on CUDA libs which pulled in a bunch of different hardware-specific kernel binaries. The models ran on CPU and didn't even need CUDA but it was incredibly hard to remove them--there was some pytorch init code that expected the CUDA crap to exist even on CPU-only.

Normal_gaussian 3 days ago [ - ]

the training data is on a separate drive; or the training data isn't that large for this use case; or they aren't training.

0cf8612b2e1e 3 days ago [ - ]

You don’t train petabytes on your laptop.