All valid and important points, but missing a painful one, also rarely represented in threads like this: flaky hardware.
Almost every bare metal success story paints a rosy picture of perfect hardware (which thankfully is often the case), or basic hard failures which are easily dealt with. Disk replacement or swapping 1u compute nodes is expected and you probably have spares on hand. But it's a special feeling to debug the more critical parts that likely don't have idle spares just sitting around. The raid controller that corrupts it's memory, reboots, and rolls back to it's previous known-good state. The network equipment that locks up with no explanation. Critical components that worked flawless for months or years, then shit the bed, but reboot cleanly.
Of course everyone built a secure management vlan and has remote serial consoles hooked up to all such devices right? Right? Oh good, they captured some garbled symbols. The vendor's first tier of support will surely not be outsourced offshore or read from a script, and will have a quick answer that explains and fixes everything. Right?
The cloud isn't always the right choice, but if you can make it work, it sure is nice to not deal with entire categories of problems when using it.
Not saying those things don’t happen, but having worked with on-prem for 2 years, and having ran ancient (13 years old currently) servers in my homelab for 5 years, I’ve never seen them. Bad CPU, bad RAM, yes - and modern servers are extremely good at detecting these and alerting you.
In my homelab, in 5 years of running the aforementioned servers (3x Dell R620, and some various Supermicros) 24/7/365, the only thing I had fail was a power supply. Turns out they’re redundant, so I ordered another one, and the spare kept the server up in the meantime. If I was running these for a business, I’d keep hot spares around.
I'm glad it's working for you! It's worked for me in the past as well, but I've also felt the pain. As I mentioned before, it's often the case that things will work, but in some ways, you need to have an increased appetite for risk.
I suppose it depends on scale and requirements. A homelab isn't very relevant IMHO, because the sample size is small and the load is negligible. Push the hardware 24/7 and the cracks are more likely to appear.
A nice-to-have service can suffer some downtime, but if you're running a non-trivial/sizable business or have regulation requirements, downtime can be rough. Keeping spare compute servers is normal, but you'll be hard pressed to convince finance to spend big money on core services (db, storage, networking) that are sitting idle as backups.
Say you convinced finance to spend
Agreed that homelab load is generally small compared to a company’s (though an initial Plex cataloging run will happily max out as many cores as you give it for days).
In the professional environment I mentioned, I think we had somewhere close to 500 physical servers across 3 DCs. They were all Dell Blades, and nothing was virtualized. I initially thought that latter bit was silly, but then I saw that no, they’d pretty well matched compute to load. If needs grew, we’d get another Blade racked.
We could not tolerate unplanned downtime (or rather, our customers couldn’t), but we did have a weekly 3-hour maintenance window, which was SO NICE. It was only a partial outage for customers, and even then, usually only a subset of them at a time. Man, that makes things easier, though.
They were also hybrid AWS, and while I was there, we spun up an entirely new “DC” in a region we didn’t have a physical one. More or less lift-and-shift, except for managed Kafka, and then later EKS.