fascinating and thank you for the great explanation. I was actually going to followup and ask with regards to AI but your response covered it aswell :)
fascinating and thank you for the great explanation. I was actually going to followup and ask with regards to AI but your response covered it aswell :)
I just expanded more in a reply to my own comment if you're interested!
To technically expand a bit on AI:
Any regularity in an environment which an embedded system can detect but fails to exploit represents an amount of excess free energy in the organism, distributed over itself, its group, its species, etc. depending on what types of systems and scales you choose to model.
There are parallels in information theory: any recognizable patterns/relationships within a compressed message represents excess entropy (the average uncertainty of future states), since that regularity was not exploited during compression and remained in the compressed structure. This means that a perfectly compressed message is functionally indistinguishable from random noise.
You can view weights in an AI model through the same lens: The weights represent "knowledge" of the environment the model has been exposed to. The model is designed to correctly predict future states, and thus "learning" is effectively the compression of a full model of the environment, which is more efficient to traverse than the uncompressed model. A perfectly learned environment minimizes uncertainty and should translate to weights that have no discernible patterns and thus are also functionally indistinguishable from random noise, void of any regularity.
Some level of "compression" of the local environment is required for any stable embedded system, or else the energy required to continually stabilize the system would require an equal amount of energy present as that in all of the universe, because the system would become a perfect copy of the very environment it is embedded within. This is obviously thermodynamically prohibitive.
Hopefully this helps make the relationship between structure, knowledge, information and uncertainty a lot more intuitive.
As a bonus, consider Fabrice Bellard's ts_zip, a great showcase on how knowledge and compression are related.
https://bellard.org/ts_zip/
ts_zip compresses text at record efficiency (at the cost of magnitudes more memory and compute, nothing is free)
Previous attempts at text compression all purely relied on character-level patterns and semantics, syntactical structure, etc., maybe with some heuristic tweaking here and there.
That got us far, but LLMs do something never achieved before, which is to incorporate relationships beyond the surface: not just placement of characters, n-grams or words, but the actual meaning behind them, and large-scale correlations with other words or tokens across vast context windows.
The LLM actually becomes a world model with enough size and training, and thus we are able to use every fact we know about everything to compress text. If we're speaking about biology for example, that constrains the probabilities of what the most likely word might be after a given prefix. Or if the context is constrained to a specific historical period.
All of these regularities can be leveraged, at the cost of a lot of energy, in order to create compressed text that gets arbitrarily close to looking like completely random noise (actually verifying this would require infinite energy though, per Kolmogorov).
The catch is, such systems are specialized and depend on the regularity of cheap, widely-available energy networks and consumer access to cheap compute. Take that away, and it becomes ill-suited vs just using bzip. I mean, even now, bzip is a better choice when considering energy tradeoffs. And ts_zip in particular is specialized to the point of only working with text and not arbitrary byte streams.