I'm looking at using OCI at $DAY_JOB for model distribution for fleets of machines also so it's good to see it's getting some traction elsewhere.

OCI has some benefits over other systems, namely that tiered caching/pull-through is already pretty battle-tested as is signing etc, beating more naive distribution methods for reliability, performance and trust.

If combined with eStargz or zstd::chunked it's also pretty nice for distributed systems as long as you can slice things up into files in such a way that not every machine needs to pull the full model weights.

Failing that there are P2P distribution mechanisms for OCI (Dragonfly etc) that can lessen the burden without resorting to DIY on Bittorrent or similar.

Kubernetes added "image volumes" so this will probably become more and more common: https://kubernetes.io/blog/2024/08/16/kubernetes-1-31-image-...

That is exactly the feature we are using, right now you need to be on a beta release of containerd but before long it should be pretty widespread. In combination with lazy pull (eStargz) it's a pretty compelling implementation.

Damn, that's handy. I now wonder how much trouble making a CSI driver that does this would be for backporting to the 1.2x clusters (since I don't think that kubernetes does backports for anything)

Not too hard. If you happen to be on CRI-O this has been implemented for a bit but if you are like us and on containerd then you need the new 2.1 beta release. That does most of the heavy lifting, implementing a CSI driver that mounted these as PVs wouldn't be super hard I don't think and you could borrow liberally from the volume source implementation.

I've been pretty disappointed with eStargz performance, though... Do you have any numbers you can share? All over the internet people refer to numbers from 10 years ago, from workloads that don't seem realistic at all. In my experiments it didn't provide a significant enough speedup.

(I ended up developing an alternative pull mechanism, which is described in https://outerbounds.com/blog/faster-cloud-compute though note that the article is a bit light on the technical details)

In our case some machines would need to access less than 1% of the image size but being able to have an image with the entire model weights as a single artifact is an important feature in and of itself. In our specific scenario even if eStargz would be slow by filesystem standards it's competing with network transfer anyway so if it's the same order of magnitude as rsync that will do.

I don't have any perf numbers I can share but I can say we see ~30% compression with eStargz which is already a small win atleast heh.