Good article, but

"We take the exponential of each input and normalize by the sum of all exponentials. This transforms a vector of arbitrary real numbers into values between 0 and 1 that sum to 1, it technically this is a pseudo-probability distribution (they're not derived from a probability space), but it's close enough to a probability distribution and for practical purposes they work just fine."

Why is this a "pseudo-probability distribution?"

Mathematically, it is literally a probability distribution, because it fits the definition of a measure whose total mass is one, so I think the language is just imprecise. What they may be trying to say is that semantically it doesn't arise in a principled way from an uncertainty model, such as from Bayesian or frequentist statistics.

Hogwash. If you get into deriving maximum entropy distributions via the calculus of variations, the multinomial is the maximum entropy distribution among categorical distributions.

This is exactly the sense that it comes up for old school LMs and why it appears in thermodynamics.

Of course it is entirely possible that newfangled ML people use it without understanding that it is derived from first principles - i.e. see article.

The comment in parenthesis mentions "they're not derived from a probability space" [1]. I don't know about probability spaces nor softmax to know what part of a probability space this is missing compared to other probability distributions, nor how other probability distributions satisfy probability spaces.

[1] https://en.wikipedia.org/wiki/Probability_space

Sounds like they're saying that since the distribution doesn't come from measuring or calculating the probability of something, it has the form of a probability distribution but isn't really one. Like saying 5 feet is a height that a person can have, but since I just made up that number it's not actually a person's height.

The soft max is the probability of the next token being whatever in the training data conditioned on the inputs. The author just doesn't know that apparently and thinks it was an arbitrary choice.

The author's essay on the sigmoid similarly lacks the deep understanding that it comes from somewhere and isn't an arbitrary choice.

iirc, there is a bunch of formal machinery you need to define probability distributions for situations such as infinite outcomes (eg what is the probability that a random real number between 0 and 10 is less than 3?)