Assuming those 20PB are hot/warm storage, S3 costs roughly $0.015/GB/month (50:50 average of S3 standard/infrequent access). That comes out to roughly $3.6M/year, before taking into account egress/retrieval costs. Does it really cost that much to maintain your own 20PB storage cluster?

If those 20PB are deep archive, the S3 Glacier bill comes out to around $235k/year, which also seems ludicrous: it does not cost six figures a year to maintain your own tape archive. That's the equivalent of a full-time sysadmin (~$150k/year) plus $100k in hardware amortization/overhead.

The real advantage of S3 here is flexibility and ease-of-use. It's trivial to migrate objects between storage classes, and trivial to get efficient access to any S3 object anywhere in the world. Avoiding the headache of rolling this functionality yourself could well be worth $3.6M/year, but if this flexibility is not necessary, I doubt S3 is cheaper in any sense of the word.

Like most of AWS, it depends if you need what it provides. A 20PB tape system will have an initial cost in the low to mid 6 figures for the hardware and initial set of tapes. Do the copies need to be replicated geographically? What about completely offline copies? Reminds me of conversations with archivists where there's preservation and then there's real preservation.

> Does it really cost that much to maintain your own 20PB storage cluster?

If you think S3 = storage cluster than the answer is no.

If you think about S3 what it actually is: scalable, high throughput, low latency, reliable, durable, low operational overhead, high uptime, encrypted, distributed, replicated storage with multiple tier1 uplinks to the internet than the answer is yes.

>scalable, high throughput, low latency, reliable, durable, low operational overhead, high uptime, encrypted, distributed, replicated storage with multiple tier1 uplinks to the internet

If you need to tick all of those boxes for every single byte of 20PB worth of data, you are working on something very cool and unique. That's awesome.

That said, most entities who have 20PB of data only need to tick a couple of those boxes, usually encryption/reliability. Most of their 20PB will get accessed at most once a year, from a predictable location (i.e. on-prem), with a good portion never accessed at all. Or if it is regularly accessed (with concomitant low latency/high throughput requirements), it almost certainly doesn't need to be globally distributed with tier1 access. For these entities, a storage cluster and/or tape system is good enough. The problem is that they naïvely default to using S3, mistakenly thinking it will be cheaper than what they could build themselves for the capabilities they actually need.

How the heck does anyone have that much data? I once built myself a compressed plaintext library from one of those data-hoarder sources that had almost every fiction book in existence, and that was like 4TB compressed (but would've been much less if I bothered hunting for duplicates and dropped non-English).

I suspect the only way you could have 20PB is if you have metrics you don't aggregate or keep ancient logs (why do you need to know your auth service had a transient timeout a year ago?)

Lots of things can get to that much data, especially in aggregate. Off the top of my head: video/image hosting, scientific applications (genomics, high energy physics, the latter of which can generate PBs of data in a single experiment), finance (granular historic market/order data), etc.

In addition to what others have mentioned, before the "AI bubble", there was a "data science bubble" where every little signal about your users/everything had to be saved so that it could be analyzed later.