Yea it is kindof cheating. I was helping someone debug why their workload was soft locking. I ran the profiling tools and found that cgroup accounting for the workload was taking nearly all the cpu time on locks. From searches through linux git logs I found that cgroup accounting in older kernels had global locks. I saw that newer kernels didn’t have this, so we moved to a newer kernels and all the issues went away.

People thought I was a wizard lol.

I am curious, why don't you update regularly? (student here)

Kernel/distro upgrades can cause severe regressions. Generally the approach we do is have the service run a canary which runs on the newer kernel for a while to A/B test the upgrade. Generally we rely on the service owner to validate this A/B test as we don’t want to own making sure services are healthy on the new kernel. This means it is primarily on the service owner to look at the results of the A/B test to determine if the upgrade is ok.

I build a cloud (such as AWS) and we have many tenants running on the cloud. Much like AWS will not force upgrade AMIs, we will not force tenants to upgrade either.