Maybe you're right - I've never tried dokploy, but from documentation it sounds like mostly a deployment, monitoring and alerting tool. For me the problem has always been that once you get the alert (or something just stops working), a human needs to react to it and make things work again. In cloud services you mostly pay for them providing the human, and in self-hosting you're the human.

I can see though that today's AI models could eventually replace the human in the loop and truly automatically fix every possible situation.

yeah i've had more downtime on managed db's & cloud servers then on my own managed VPS. And if it happens, with VPS i can normally fix it instantly compared to waiting 20-60 min for a response, just to let you know they start fixing it. And when they fix it, it doesnt always mean your instance automatically works.

I must be using the wrong cloud services. Whenever a part of our app goes down someone on the team still needs to respond to it.

You might be right. I've been mostly using serverless / managed cloud services such as AWS Lambda, API Gateway, S3, DynamoDB for the past 10+ years. When I've needed to respond, it's been because I myself deployed a bad update and needed to roll it back, or a third party integration broke. The cloud platform itself has been very stable, and during the couple of bigger incidents that have happened, I've just waited for AWS to fix it and for things to start working again.

you actually need new ops teammates, not new cloud services :)