Gitlab used to be about as reliable as github. (ignoring the security oopses they used to have)

They simply don't have (or didnt) the skills to scale. THey were talking about using ceph to run things (which gives you an idea about how green their infra team was)

Are you implying they should create more in-house solutions, or that specifically Ceph is not a good solution and there is some other 3rd party solution that could be used instead?

What's wrong with Ceph?

Whats right with it?

Its slow, large, excessively complex and not that resilient to failure.

You either want a bunch of NFS machines backed on to ZFS on nvme, with a central jumping off point that allows sharding (this is critical to allow one or more NFS server to fuck up and not kill access to everything else.)

Or, pay the money and use GPFS

As someone who's in charge of close to an exabyte on Ceph, I couldn't disagree with you more.

Done correctly, Ceph is extremely reliable, resilient, and fast. Once you get over the initial learning curve, dare I say, even a joy to work with.

I concur, even though I have only used it as a hobbyist.