The problem with live patching is twofold.
First, you might not reload everything in memory, so it will be patched on disk but not in process.
Second, you have not tested that the system can boot to a functional system. Say you have done live patching for 5 years and never rebooted, and then you have a power loss or hardware failure/upgrade that takes the system down. When you try to bring it back up, it doesn't work. Which configuration change in the past 5 years caused that? Which backup do you use?
And, yeah, everything is hot swappable on VAX. Those machines also cost 6+ figures, and often require a service contract that includes a permanent on site tech.
And, yeah, everything is hot swappable on VAX.
Only the last generation or 2 of the highest end VAXen had any significant hot swap (VAX 9000/400 and later, which sold very poorly). The vast majority of VAX machines didn't. Even hot-swapping DSSI disks was at best iffy.
When someone whose been there talks about VAX 'high availability', they're usually talking about VAX/VMS clustering. Very cool and generally effective approach to the problem. That was one big issue with the end-game VAXen: clustering a couple of 6-figure mid-range machine was often considered a better solution than all-in on one 7- to 8-figure VAX 'mainframe'.
often require a service contract that includes a permanent on site tech.
I don't recall that being common with DEC service contracts. Most of the sites I know of that had dedicated DEC techs were either very large installs or had...other...drivers (e.g. tech had to have a TS clearance to work on the machines).
How would you implement no-downtime hot swap with only one item?
Which is moot, because of the system is important enough you'll have an automatic failover to another system running on standby
All this "we must reboot to test" is bullshit excuses by unqualified workers
Had an accidental reboot, and it could not boot. Had redundancy, but the other server had failed silently days prior. Solved it with three way redundancy and extra monitoring. Systems fail in many ways at the same time. If you do not test it, there is a chance it wont work. Controlled failure is preferred over unknowns, like rebooting once in a while just to make sure it works.
Not sure I'm following honestly. Your primary goes down and it fails over to the secondary (which becomes the primary), but if you can't boot how do you then get another secondary ready to fail over to again when the new primary inevitably fails?
Ah, spoken with the confidence of a freshly minted qualified worker :). Anything you don’t test is a wish, not a production system. You either know that your systems work end to end because you tested periodically, or you pray they will.
How do you know the automatic failover works? How do you know the standby system works?
I’ve seen many a “qualified workers” getting sent packing because they never fully tested the prod system because they just knew everything will work, and never tested the backup systems because qualified workers do the job right the first time, no need for backup.
>First, you might not reload everything in memory, so it will be patched on disk but not in process.
You design for this with generational tagged objects or something similar.
You patch it in memory and on disk. What you put on disk is the patch though, so when you restart, the original unpatched version is booted, and then the same live patch is applied. This is how Ksplice worked. It has the advantage that there isn't a config file in /etc to get changed out from under it, so the second problem did not apply.
Yes, some things actually cost money, especially if they aren't easy to implement.