"This transforms a vector of arbitrary real numbers into values between 0 and 1 that sum to 1"

Not really, softmax transforms logits (logariths of probabilities) into probabilities.

Probabilities → logits → back again.

Start with p = [0.6, 0.3, 0.1]. Logits = log(p) = [-0.51, -1.20, -2.30]. Softmax(logits) = original p.

NN prefer to output logits because they are linear and go from -inf to +inf.

Softmax is defined over an arbitrary vector of raw real numbers. Stating that those inputs are "logits" is applying post-hoc semantics to what the model is learning. One of the key properties of a softmax is scale invariance, (e.g. softmax([-1, 1, 3, 5]) == softmax([9, 11, 13, 15])) and so it is easiest to just think of it as operating on a vector of unnormalized raw scores, which is the more colloquial definition of logit.

(also, log(p) is not the formal definition of a logit)

It's still true that softmax transforms arbitrary vectors into probability vectors.

In your example you'll also get the original `p` with just `exp(logits)`. Softmax normalizes the output to sum to one, so it can output a probability vector even if the input is _not_ simply `log(p)`.

Logit is log odds, not log probability: logit(p) = log(p / 1 - p)