On the other hand, it is (both halting and spec adherence) are checkable under compute and space constraints though? :) I'd say the biggest hurdle are means to describe the spec in way that is easy enough for a human to produce to make it feasible.

Not a DB person either, but things like TLA+ seem very hard to write even with LLMs. Behavioral tests with an enumerable number of random paths to take (aka model checking - eg jepsen) seem more feasible. Although you can't check internal properties of the system (string `pass` or any of it's copies or parts are not held anywhere in memory at any point between lines A and B) unless we can check that two memory dumps are indistinguishable with different pass strings (assuming we abstracted away storage devices in a test environment).. Also not sure if it's "easy enough" to write such tests either.

Maybe the reason is that OS domain objects / primitives are too complex and not "isolatable" enough / lack a clear contract at all? (Hence multi file refactorings that break invariants.)