I think the disagreement here comes from overcomplicating what is actually a very simple claim.
I am not reasoning about infinities, cardinalities of infinite sets, or expectations over randomly sampled programs. None of that is needed. You do not need infinities to see that one set is smaller than another. You only need to show that one set contains everything the other does, plus more.
Forget “all possible programs” and forget randomness entirely. We only need to reason about possible runtime outcomes under identical conditions.
Take a language and hold everything constant except static type checking. Same runtime, same semantics, same memory model, same expressiveness. Now ask a very concrete question: what kinds of failures can occur at runtime?
In the dynamically typed variant, there exist programs that execute and then fail with a runtime type error. In the statically typed variant, those same programs are rejected before execution and therefore never produce that runtime failure. Meanwhile, any program that executes successfully in the statically typed variant also executes successfully in the dynamic one. Nothing new can fail in the static case with respect to type errors.
That is enough. No infinities are involved. No counting is required. If System A allows a category of runtime failure that System B forbids entirely, then the set of possible runtime failure states in B is strictly smaller than in A. This is simple containment logic, not higher math.
The “randomly picked program” framing is a red herring. It turns this into an empirical question about distributions, likelihoods, and developer behavior. But the claim is not about what is likely to happen in practice. It is about what can happen at all, given the language definition. The conclusion follows without measuring anything.
Similarly, arguments about time spent satisfying the type checker or opportunity cost shift the discussion to human workflow. Those may matter for productivity, but they are not properties of the language’s runtime behavior. Once you introduce them, you are no longer evaluating reliability under identical technical conditions.
On the definition of reliability: the specific word is not doing the work here. Once everything except typing is held constant, all other dimensions are equal by assumption. There is literally nothing else left to compare. What remains is exactly one difference: whether a class of runtime failures exists at all. At that point, reliability reduces to failure modes not by preference or definition games, but because there is no other remaining axis. I mean everything is the same! What else can you compare if not the type errors? Then ask the question which one is more reliable? Well… everything is the same except one has run time type errors, while the other doesn’t… which one would you call more “reliable”? The answer is obvious.
So the claim is not that statically typed languages produce correct programs or better engineers. The claim is much narrower and much stronger: holding everything else fixed, static typing removes a class of runtime failures that dynamic typing allows. That statement does not rely on infinities, randomness, or empirical observation. It follows directly from what static typing is.