The comment in parenthesis mentions "they're not derived from a probability space" [1]. I don't know about probability spaces nor softmax to know what part of a probability space this is missing compared to other probability distributions, nor how other probability distributions satisfy probability spaces.
Sounds like they're saying that since the distribution doesn't come from measuring or calculating the probability of something, it has the form of a probability distribution but isn't really one. Like saying 5 feet is a height that a person can have, but since I just made up that number it's not actually a person's height.
The soft max is the probability of the next token being whatever in the training data conditioned on the inputs. The author just doesn't know that apparently and thinks it was an arbitrary choice.
The author's essay on the sigmoid similarly lacks the deep understanding that it comes from somewhere and isn't an arbitrary choice.
iirc, there is a bunch of formal machinery you need to define probability distributions for situations such as infinite outcomes (eg what is the probability that a random real number between 0 and 10 is less than 3?)