If you like this post, I would recommend “BPF Performance Tools” and “Systems Performance: Enterprise and the Cloud” by Brenden Gregg.

I have pulled out a few miracles using these tools (identifying kernel bottlenecks or profiling programs using ebpf) and it has been well worth the investment to read through the books.

Agreed, highly recommended reading. A slightly more up-to-date post of his which recommends tools in such situations is: https://www.brendangregg.com/blog/2024-03-24/linux-crisis-to...

Literally did miracles at my last job with the first book and that got me my current job, where I also did some impressive proving which libraries had what performance with it again... Seriously valuable stuff.

Yea it is kindof cheating. I was helping someone debug why their workload was soft locking. I ran the profiling tools and found that cgroup accounting for the workload was taking nearly all the cpu time on locks. From searches through linux git logs I found that cgroup accounting in older kernels had global locks. I saw that newer kernels didn’t have this, so we moved to a newer kernels and all the issues went away.

People thought I was a wizard lol.

I am curious, why don't you update regularly? (student here)

Kernel/distro upgrades can cause severe regressions. Generally the approach we do is have the service run a canary which runs on the newer kernel for a while to A/B test the upgrade. Generally we rely on the service owner to validate this A/B test as we don’t want to own making sure services are healthy on the new kernel. This means it is primarily on the service owner to look at the results of the A/B test to determine if the upgrade is ok.

I build a cloud (such as AWS) and we have many tenants running on the cloud. Much like AWS will not force upgrade AMIs, we will not force tenants to upgrade either.