Or… you build a container, that runs exactly what you specify. You print your logs, traces, metrics home so you can capture those stack traces and error messages so you can fix it and make another container to deploy.

You’ll never attach a debugger in production. Not going to happen. Shell into what? Your container died when it errored out and was restarted as a fresh state. Any “Sherlock Holmes” work would be met with a clean room. We have 10,000 nodes in the cluster - which one are you going to ssh into to find your container to attach a shell to it to somehow attach a debugger?

> We have 10,000 nodes in the cluster - which one are you going to ssh into to find your container to attach a shell to it to somehow attach a debugger?

You would connect to any of the nodes having the problem.

I've worked both ways; IMHO, it's a lot faster to get to understanding in systems where you can inspect and change the system as it runs than in systems where you have to iterate through adding logs and trying to reproduce somewhere else where you can use interactive tools.

My work environment changed from an Erlang system where you can inspect and change almost everything at runtime to a Rust system in containers where I can't change anything and can hardly inspect the system. It's so much harder.

Say you are debugging a memory leak in your own code that only shows up in production. How do you propose to do that without direct access to a production container that is exhibiting the problem, especially if you want to start doing things like strace?

I will say that, with very few exceptions, this is how a lot of $BigCo manage everyday. When I run into an issue like this, I will do a few things:

- Rollback/investigate the changelog between the current and prior version to see which code paths are relevant

- Use our observability infra that is equivalent to `perf`, but samples ~everything, all the time, again to see which codepaths are relevant

- Potentially try to push additional logging or instrumentation

- Try to better repro in a non-prod/test env where I can do more aggressive forms of investigation (debugger, sanitizer, etc.) but where I'm not running on production data

I certainly can't strace or run raw CLI commands on a host in production.

Combined with stack traces of the events, this is the way.

If you have a memory leak, wrap the suspect code in more instrumentation. Write unit tests that exercise that suspect code. Load test that suspect code. Fix that suspect code.

I’ll also add that while I build clusters and throw away the ssh keys, there are still ways to gain access to a specific container to view the raw logs and execute commands but like all container environments, it’s ephemeral. There’s spice access.