Say you are debugging a memory leak in your own code that only shows up in production. How do you propose to do that without direct access to a production container that is exhibiting the problem, especially if you want to start doing things like strace?
I will say that, with very few exceptions, this is how a lot of $BigCo manage everyday. When I run into an issue like this, I will do a few things:
- Rollback/investigate the changelog between the current and prior version to see which code paths are relevant
- Use our observability infra that is equivalent to `perf`, but samples ~everything, all the time, again to see which codepaths are relevant
- Potentially try to push additional logging or instrumentation
- Try to better repro in a non-prod/test env where I can do more aggressive forms of investigation (debugger, sanitizer, etc.) but where I'm not running on production data
I certainly can't strace or run raw CLI commands on a host in production.
Combined with stack traces of the events, this is the way.
If you have a memory leak, wrap the suspect code in more instrumentation. Write unit tests that exercise that suspect code. Load test that suspect code. Fix that suspect code.
I’ll also add that while I build clusters and throw away the ssh keys, there are still ways to gain access to a specific container to view the raw logs and execute commands but like all container environments, it’s ephemeral. There’s spice access.