From a security perspective this is a non-starter. If you leave your MongoDB instance open and I steal the telemetry you are collecting, I can reverse engineer the data into meaningful insights into cluster workloads. So all your potential national security customers or IP sensitive customers (finance, biotech, etc) are immediately out.

Any competent enterprise risk team is going to give a hard no to a SaaS application being in the critical path for on-prem business critical workloads. So there goes Fortune 100 too.

If you are successful and better schedule workloads you are just deferring upgrades and expansions. The customers Dell/HPE/etc. sales rep is going to freak out, some vice presidents are going to go golfing together, and all the remaining high value customers don't renew.

What you are really left with is the "small and medium business" clusters that are purpose specific. They are running 100% on a handful of tasks that can probably be hand tuned.

This sounds like really cool technology, I just don't see the business. Hopefully you'll consider open sourcing it soon.

Thanks, the security point is valid, so let me be specific about how deployment works for us!

There's no telemetry egress. Deployments are air-gapped and run in the customer's VPC, on their own hardware. We don't ship telemetry out to a SaaS backend to reverse-engineer; the data never leaves their environment, and for on-prem/air-gapped customers there's zero egress and full audit logging. We are doing all this because finance, biotech, and national-scale customers are the design target for us - we all worked in the space and understand what security measures need to be in place for this to work.

For example, the "open MongoDB" failure you mentioned isn't something that would concern us, because there's no central store of their data to leak.

On "SaaS being in the critical path": we agree, and that's why we're not in it. We're not a scheduler or a runtime. Our daemon is passive and if it falls over, jobs still submit and run exactly as they do today. We sit alongside as a prediction/recommendation layer, not in the path that has to be up for the cluster to work

For upgrades and expansions with increasing utilisation, most large scale compute users are capacity constrained and growing faster than they can buy GPUs. If anything we are delaying the expansion not killing it. In terms of unit economics, being able to serve more users with tighter user allocations is a net positive for cloud providers and is something they actively try and pursue :)

Probably the most helpful advice I can give you is pointing out that I wrote my comment after reading your homepage and docs. :)

I used to run security for building size computers if you want any feedback. My email is in my profile.