Despite multiple comments blaming the AI agent, I think it's the backups that are the problem here, right? With backups, almost any destructive action can be rolled back, whether it's from a dumb robot, a mistaken junior, or a sleep-deprived senior. Without, you're sort of running the clock waiting for disaster.

Yes, backups are great but a 'dumb robot' or a 'mistaken junior' shouldn't have access to prod.

And a sleep-deprived senior? Even then. They shouldn't have access to destructive effects on prod.

Maybe the senior can get broader access in a time-limited scope if senior management temporarily escalates the developers access to address a pressing production issue, but at that point the person addressing the issue shouldn't be fighting to stay awake nor lulled into a false sense of security as during day to day operations.

Otherwise it's only the release pipeline that should have permissions to take destructive actions on production and those actions should be released as part of a peer reviewed set of changes through the pipeline.

If a sleep-deprived senior shouldn’t have access to prod, I think we have big problems, frankly.

Which, if you're Google-sized, you have follow-the-sun rotations, in order to avoid that problem. But what about the rest of the class?

But smart robots like Claude should and will have access to production. There has to be something figured out on how to make sure operation remains smooth. The argument of don't do that will not be a viable position to hold long term. Keeping a human in the loop is not necessary.

It is absolutely necessary. Point in fact, most DEVs don't have access to PROD either. Specialists do.

Clause, maybe, is a junior DEV.

Not a release engineer.

Should and will are pretty large assumptions given the the post we're commenting on!

> will not be a viable position to hold long term

Why not? We've literally done it without robots, smart or dumb, for years.

>We've literally done it without robots, smart or dumb, for years.

And we've written extremely buggy and insecure C code for decades too. That doesn't mean that we should keep doing that. AI can much faster troubleshoot and resolve production issues than humans. Putting humans in the loop will cause for longer downtime and more revenue loss.

> AI can much faster troubleshoot and resolve production issues than humans

Can, yes, with proper guardrails. The problem is that it seems like every team is learning this the hard way. It'd be great to have a magical robot that could magically solve all our problems without the risk of it wrecking everything. But most teams aren't there yet and to suggest that it's THE way to go without the nuances of "btw it could delete your prod db" is irresponsible at best.

It didn't delete the prod db on its own a human introduced such error, and if there were backups it could fix such a mistake.

There were backups. The AI deleted them.

When people talk about backups they typically mean located somewhere else. If one terraform command can take out the db and the backups then those backups aren't really separate. It's like using RAID as a backup. Sure it may help, but there are cases where you can lose everything.

Nobody, not even a "smart robot" should have unfettered read-write production access without guardrails. Read-only? Sure - that's a totally different story.

Read-write production access without even the equivalent of "sudo" is just insane and asking for trouble.

> Keeping a human in the loop is not necessary.

You don't work in anything considered Safety Critical, do you?

You need to care about your Recovery Time (how long does it take to get back up again?) and your Recovery Point(how long since your backup was taken?) and it gets Much Worse when you start distributing state around your various cloud systems - oh did that queue already get that message? how do we re-send that? etc

They are two orthogonal issues. One doesn't make the other irrelevant.

I agree that a second issue doesn't erase the first, but also I've got enough work experience to know that a system which can be brought down by 1 person no matter the tooling they use is a system not destined to last for long.

Zero workmanship was always worth nothing.

It usually takes about 10 months for folks to have a moment of clarity. Or for the true believer they often double down on the obvious mistakes. =3

100% agree. Everyone should always backup their production database somewhere where's it's not trivial to delete.