I would say compute and storage separation is the way to go, especially for hyperscaler offering ala aurora db/cosmos/alloy. And later more opensource alternatives will catch up.
I would say compute and storage separation is the way to go, especially for hyperscaler offering ala aurora db/cosmos/alloy. And later more opensource alternatives will catch up.
Most analytics workloads are bandwidth-bound if you are optimizing them at all. The major issue with disaggregated storage is that the storage bandwidth is terrible in the cloud. I can buy a server from Dell with 10x the usable storage bandwidth of the fastest environments in AWS and that will be reflected in workload performance. The lack of usable bandwidth even on huge instance types means most of that compute and memory is not doing much — you are forced to buy compute you don’t need to access mediocre bandwidth of which there is never enough. The economics are poor as a result.
This is an architectural decision of the cloud providers to some extent. Linux can drive well over 1 Tbps of direct-attached storage bandwidth on a modern server but that bandwidth is largely beyond the limits of cheap off-the-shelf networking that disaggregated storage is often running over.
Object storage does scale out to that performance (via replication) but you do need to use multiple compute instances as you only get say 100Gb on each which is low. You can also do some of the filtering in the api which helps too.