Even more importantly, it's not even a simple probability of death, or a fraction of a cause, or any simple one-dimensional aspect. Even if you can simplify things down to an "arrow", the label isn't a scalar number. At a bare minimum, it's a vector, just like embeddings in LLMs are!

Even more importantly, the endpoints of each such causative arrow are also complex, fuzzy things, and are best represented as vectors. I.e.: diseases aren't just simple labels like "Influenza". There's thousands of ever-changing variants of just the Flu out there!

A proper representation of a "disease" would be a vector also, which would likely have interesting correlations with the specific genome of the causative agent. [1]

Next thing is that you want to consider the "vector product" between the disease and the thing it infected to cater for susceptibility, previous immunity, etc...

A hop, skip, and a small step and you have... Transformers, as seen in large language models. This is why they work so well, because they encode the complex nuances of reality in a high-dimensional probabilistic causal framework that they can use to process information, answer questions, etc...

Trying to manually encode a modern LLM's embeddings and weights (about a terabyte!) is futile beyond belief. But that's what it would take to make a useful "classical logic" model that could have practical applications.

Notably, expert systems, which use this kind of approach were worked on for decades and were almost total failures in the wider market because they were mostly useless.

[1] Not all diseases are caused by biological agents! That's a whole other rabbit hole to go down.

That was very well said.

One quibble, and really mean only one:

> a high-dimensional probabilistic causal framework

Deep learning models aka neural network type models, are not probabilistic frameworks. While we can measure on the outside a probability of correct answers across the whole training set, or any data set, there is no probabilistic model.

Like a Pachinko game, you can measure statistics about it, but the game itself is topological. As you point out very clearly, these models perform topological transforms, not probabilistic estimations.

This becomes clear when you test them with different subsets of data. It quickly becomes apparent that the probabilities of the training set are only that. Probabilities of the exact training set only. There is no probabilistic carry over to any subset, or for generalization to any new values.

They are estimators, approximators, function/relationship fitters, etc. In contrast to symbolic, hard numerical or logical models. But they are not probabilistic models.

Even when trained to minimize a probabilistic performance function, their internal need to represent things topologically creates a profoundly "opinionated" form of solution, as apposed to being unbiased with respect to the probability measure. The measure never gets internalized.

What’s the relationship between what you’re saying and the concepts of “temperature” and “stochasticity”? The model won’t give me the same answer every time.

You are just adding random behavior to the system to create variation in response.

Random behavior in inputs, or in operations, results in random behavior in the outputs. But there is no statistical expression or characterization that can predict the distribution of one from the other.

You can't say, I want this much distribution in the outputs, so I will add this much distribution to the inputs, weights or other operational details.

Even if you create an exhaustive profile of "temperature" and output distributions across the training set, it will only be true for exactly that training set, on exactly that model, for exactly those random conditions. And will vary significantly and unpredictably across subsets of that data, or new data, and different random numbers injected (even with the same random distribution!).

Statistics are a very specific way to represent a very narrow kind of variation, or for a system to produce variation. But lots of systems with variation, such as complex chaotic systems, or complex nonlinear systems (as in neural models!) can defy robust or meaningful statistical representations or analysis.

(Another way to put this, is you can measure logical properties about any system. Such as if an output is greater than some threshold, or if two outputs are equal. The logical measurements can be useful, but that doesn't mean it is a logical system.

Any system with any kind of variation can have potentially useful statistical type measurements done on it. Any deterministic system can have randomness injected to create randomly varying output. But neither of those situations and measurements makes the system a statistically based system.)

The probability distribution that the model outputs is deterministic. The decoding method that uses that distribution to decide what next token to emit may or may not be deterministic. If we decide to define the decoding method as part of "the model", then I guess the model is probabilistic.

It's also worth noting that the parameters (weights and biases) of the model are random variables, technically speaking, and this can be considered probabilistic in nature. The parameter estimates themselves are not random variables, to state the obvious. The estimates are simply numbers.

[deleted]
[deleted]

You're losing interpretability and scrutability, but gaining detail and expressiveness. You have no way to estimate the vectors in a causal framework, all known methods are correlational. You have no clean way to map the vectors to human concepts. Vectors are themselves extremely compressed representations, there is no clear threshold beyond which a representation becomes "proper".