Hacker News

rhdunn 3 hours ago [ - ]

The comment in parenthesis mentions "they're not derived from a probability space" [1]. I don't know about probability spaces nor softmax to know what part of a probability space this is missing compared to other probability distributions, nor how other probability distributions satisfy probability spaces.

[1] https://en.wikipedia.org/wiki/Probability_space

ghkbrew an hour ago [ - ]

Sounds like they're saying that since the distribution doesn't come from measuring or calculating the probability of something, it has the form of a probability distribution but isn't really one. Like saying 5 feet is a height that a person can have, but since I just made up that number it's not actually a person's height.

jmalicki an hour ago [ - ]

The soft max is the probability of the next token being whatever in the training data conditioned on the inputs. The author just doesn't know that apparently and thinks it was an arbitrary choice.

The author's essay on the sigmoid similarly lacks the deep understanding that it comes from somewhere and isn't an arbitrary choice.

mojosmojo an hour ago [ - ]

iirc, there is a bunch of formal machinery you need to define probability distributions for situations such as infinite outcomes (eg what is the probability that a random real number between 0 and 10 is less than 3?)