Isn't this like Cyc? There have been a couple of interesting articles about that on HN:

https://news.ycombinator.com/item?id=43625474 "Obituary for Cyc"

https://news.ycombinator.com/item?id=40069298 "Cyc: History's Forgotten AI Project"

Seems like a subset of CYC - attempting to gather causal data rather than declarative data in general.

It's a bit odd that their paper doesn't even mention CYC once.

Cyc is hardly [1] mentioned in modern work under the knowledge representation and reasoning umbrella, because most [2] of it was/is unavailable or unknown to most researchers. It's hard to build on something that's primarily marketing material.

[1] I could be wrong, but even those that mention Cyc use it only as a historical example of early work in KRR / symbolic AI. [2] OpenCyc being the small subset which is available, tho I haven't met anyone who worked with it.

Everything old is new again

The sample set contains:

    {
        "causal_relation": {
            "cause": {
                "concept": "boom"
            },
            "effect": {
                "concept": "bust"
            }
        }
    }
It's practically a hedge-fund-in-a-box.

Plus, regardless of what you might think of how valid that connection is, what they're actually collecting, absent any kind of mechanism, is a set of all apparent correlations...

And what most of us know, correlation doesn't necessarily equal causation

I read it as "casual" rather than "causal", got very dissapointed while reading the article!

An inventory of casual knowledge would be really fun, although it's hard to think what it would consist of now that I think about it...

There is this concept of "hidden knowledge" about all the things you know at work that no one really thinks about is knowledge so it's hard to let newcomers know about it.

But that does sound different than "casual knowledge", and so does "trivia".

Oh well!

I did, too! And it reminded me of a project idea I had a while ago:

A time traveller's wiki that collects casual knowledge for different times (and different places).

Such as: "Buying a train ticket in Paris in 1972".

But it was a shower thought and it's pretty hard to imagine how this knowledge should be collected and especially presented.

In a way, wikipedia is already doing this by keeping records of articles as they change over the years :)

The article about train tickets wasn't so good as an example but "computer monitor" from 2004 is kind of fun to read :)

Unfortunately, "casual knowledge" is often omitted when writing informative articles. In this example, there is no mention that power buttons are often located somewhere in the back of the monitor, which was good to know in 2004. Also, some monitors are drawing power from the computer, thus they won't power up before the computer will. And speaking of that: You may want to turn of your computer after shutdown!

Edit: This would probably be useful for novelists and filmmakers (in addition to the casual time traveller)

I find casual knowledge particularly interesting, because it's the exact kind of thing that's most related to my day to day experience, but is simultaneously the exact thing that our encyclopedias and AIs omit.

Could you be referring to what is known as tacit knowledge?

it's as simple as precisely describing "common sense"

Yes, that very simple task :D

But I also wonder if that is actually an equivalent.

I think of "common sense" as think you should know to function well in your current society.

I really don't know what "casual knowledge" would mean. In my head it's some kind of low stakes knowledge for everyday life (but more 'useful' than trivia).

Maybe the order is "common sense", "casual knowledge" and trivia?

This makes little sense to me. Ontologies and all that have been tried and have always been found to be too brittle. Take the examples from the front page (which I expect to be among the best in their set): human_activity => climate_change. Those are such a broad concepts that it's practically useless. Or disease => death. There's no nuance at all. There isn't even a definition of what "disease" is, let alone a way to express that myxomatosis is lethal for only European rabbits, not humans, nor gold fish.

Given we've tried to develop such ontologies constantly for thousands of years now. What do you think the cause for such hopeless optimism might be? If only we had a database of causal relationships to consult...

Even more importantly, it's not even a simple probability of death, or a fraction of a cause, or any simple one-dimensional aspect. Even if you can simplify things down to an "arrow", the label isn't a scalar number. At a bare minimum, it's a vector, just like embeddings in LLMs are!

Even more importantly, the endpoints of each such causative arrow are also complex, fuzzy things, and are best represented as vectors. I.e.: diseases aren't just simple labels like "Influenza". There's thousands of ever-changing variants of just the Flu out there!

A proper representation of a "disease" would be a vector also, which would likely have interesting correlations with the specific genome of the causative agent. [1]

Next thing is that you want to consider the "vector product" between the disease and the thing it infected to cater for susceptibility, previous immunity, etc...

A hop, skip, and a small step and you have... Transformers, as seen in large language models. This is why they work so well, because they encode the complex nuances of reality in a high-dimensional probabilistic causal framework that they can use to process information, answer questions, etc...

Trying to manually encode a modern LLM's embeddings and weights (about a terabyte!) is futile beyond belief. But that's what it would take to make a useful "classical logic" model that could have practical applications.

Notably, expert systems, which use this kind of approach were worked on for decades and were almost total failures in the wider market because they were mostly useless.

[1] Not all diseases are caused by biological agents! That's a whole other rabbit hole to go down.

You're losing interpretability and scrutability, but gaining detail and expressiveness. You have no way to estimate the vectors in a causal framework, all known methods are correlational. You have no clean way to map the vectors to human concepts. Vectors are themselves extremely compressed representations, there is no clear threshold beyond which a representation becomes "proper".

That was very well said.

One quibble, and really mean only one:

> a high-dimensional probabilistic causal framework

Deep learning models aka neural network type models, are not probabilistic frameworks. While we can measure on the outside a probability of correct answers across the whole training set, or any data set, there is no probabilistic model.

Like a Pachinko game, you can measure statistics about it, but the game itself is topological. As you point out very clearly, these models perform topological transforms, not probabilistic estimations.

This becomes clear when you test them with different subsets of data. It quickly becomes apparent that the probabilities of the training set are only that. Probabilities of the exact training set only. There is no probabilistic carry over to any subset, or for generalization to any new values.

They are estimators, approximators, function/relationship fitters, etc. In contrast to symbolic, hard numerical or logical models. But they are not probabilistic models.

Even when trained to minimize a probabilistic performance function, their internal need to represent things topologically creates a profoundly "opinionated" form of solution, as apposed to being unbiased with respect to the probability measure. The measure never gets internalized.

What’s the relationship between what you’re saying and the concepts of “temperature” and “stochasticity”? The model won’t give me the same answer every time.

You are just adding random behavior to the system to create variation in response.

Random behavior in inputs, or in operations, results in random behavior in the outputs. But there is no statistical expression or characterization that can predict the distribution of one from the other.

You can't say, I want this much distribution in the outputs, so I will add this much distribution to the inputs, weights or other operational details.

Even if you create an exhaustive profile of "temperature" and output distributions across the training set, it will only be true for exactly that training set, on exactly that model, for exactly those random conditions. And will vary significantly and unpredictably across subsets of that data, or new data, and different random numbers injected (even with the same random distribution!).

Statistics are a very specific way to represent a very narrow kind of variation, or for a system to produce variation. But lots of systems with variation, such as complex chaotic systems, or complex nonlinear systems (as in neural models!) can defy robust or meaningful statistical representations or analysis.

(Another way to put this, is you can measure logical properties about any system. Such as if an output is greater than some threshold, or if two outputs are equal. The logical measurements can be useful, but that doesn't mean it is a logical system.

Any system with any kind of variation can have potentially useful statistical type measurements done on it. Any deterministic system can have randomness injected to create randomly varying output. But neither of those situations and measurements makes the system a statistically based system.)

The probability distribution that the model outputs is deterministic. The decoding method that uses that distribution to decide what next token to emit may or may not be deterministic. If we decide to define the decoding method as part of "the model", then I guess the model is probabilistic.

It's also worth noting that the parameters (weights and biases) of the model are random variables, technically speaking, and this can be considered probabilistic in nature. The parameter estimates themselves are not random variables, to state the obvious. The estimates are simply numbers.

[deleted]
[deleted]

Democritus (b 460BCE) said, “I would rather discover one cause than gain the kingdom of Persia,” which suggests that finding true causes is rather difficult.

"According to the Greek historian Herodotus, Xerxes's first attempt to bridge the Hellespont ended in failure when a storm destroyed the flax and papyrus cables of the bridges. In retaliation, Xerxes ordered the Hellespont (the strait itself) whipped three hundred times, and had fetters thrown into the water."

Not so sure one should take stories about who said something in ancient times at face value ;)

[1] https://en.wikipedia.org/wiki/Xerxes_I

If you think you can use logic to determine human behavior in the past, well, it doesn't even work for modern behavior lol You'd be surprised what kind of beliefs about the world led to what kind of actions in history

Felix, qui potuit rerum cognoscere causas. [0]

Virgil.

[0] https://en.m.wikipedia.org/wiki/Felix,_qui_potuit_rerum_cogn...

Virgili, hoc postremo dico: mihi nomen non est Felix.

Or is less of a hassle.

perhaps in a similar way that it is impossible to directly "observe" a wavefunction without collapsing it into an observale "effect".

I totally agreed that in the past years of hammering out an ontology for a particular area just results in a common understanding between those who wrote the ontology and a large gulf between them and the people they want to use it ( everyone else ).

What's perhaps different is that the machine, via LLM's, can also have an 'opinion' on meaning or correctness.

Going fully circle I wonder what would happen if you got LLM's to define the ontology....

>what would happen if you got LLM's to define the ontology.

https://deepsense.ai/resource/ontology-driven-knowledge-grap...

>hammering out an ontology for a particular area just results in a common understanding between those who wrote the ontology and a large gulf between them and the people they want to use it

This is the other side of the bitter lesson, which is just the empirical observation of a phenomenon that was to be expected from first principles (algorithmic information theory): a program of minimal length must get longer if the reality it models becomes more complex.

For ontologists, the complexity of the task increases as the generality is maintained while model precision is increased (top down approach), or conversely, when precision is maintained the "glue" one must add to build up a bigger and bigger whole while keeping it coherent becomes more and more complex (bottom up approach).

[dead]

Koller and Friedman write in "Probabilistic Graphical Models" about the "clarity test", so that state variables should be clear for an all seeing observer.

States like "human_activity" are not objectively measurable.

Fairly PGMs and causal models are not the same, but this way of thinking about state variables is an incredible good filter.

> States like "human_activity" are not objectively measurable.

Well, or at least they would need a heavy dose of operationalisation.

But “disease => death” + AI => surely at least few billion in VC funding.

The best thing about this statement is that it can be read as 'the fact that disease causes death, plus the application of AI, will surely lead to billions VC funding' but it can also be read as 'disease is to death as AI is to a few billion in VC funding'. :D

Exactly. In some cases disease causes death. In others it causes immunity which in turn causes “good health” and postpones death.

Contradictory cause-effect examples, each backed up with data, are a reliable indicator of a class of situations that need a higher chain-effect resolution.

Which is directly usable knowledge if you are building out a causal graph.

In the meantime, a cause and effect representation isn't limited to only listing one possible effect. A list of alternate disjoint effects, linked to a cause, is also directly usable.

Just as an effect may be linked to different causes. Which if you only know the effect, in a given situation, and are trying to identify cause, is the same problem in reverse time.

It is my opinion that if we examine any factor closely, it will have multiple disjoint effects. As in nothing is absolutely unilateral in its effects. Some of those effects will depend on certain conditions. If it is possible to specify condition, annotations, and other nuances such as levels of confidence or source of the opinion, such a database might be pretty useful.

[deleted]

Ontology, not ontologies, have been tried.

We have quite a good understanding that a system cannot be both sound a complete, regardless people went straight in to make a single model of the world.

> a system cannot be both sound a complete

Huh, what do you mean by this? There are many sound and complete systems – propositional logic, first-order logic, Presburger arithmetic, the list goes on. These are the basic properties you want from a logical or typing system. (Though, of course, you may compromise if you have other priorities.)

None of these systems are both sound and complete.

first-order logic is sound, but not complete (Ie. I can express a set of strings you can not recognize in first-order logic).

My take is that the GP was implicitly referring to Gödel’s Incompleteness Theorems with the implication being that a system that reasons completely about all the human topics and itself is not possible. Therefore, you’d need multiple such systems (plural) working in concert.

That doesn't make much sense.

If you take multiple systems and make them work in concert, you just get a bigger system.

> If you take multiple systems and make them work in concert, you just get a bigger system.

The conclusion may be wrong, but a "bigger system" can be larger than the sum of its constituents. So a system can have functions, give rise to complexity, neither of its subsystems feature. An example would be the thinking brain, which is made out of neurons/cells incapable of thought, which are made out of molecules incapable of reproduction, which are made from atoms incapable of catalyzing certain chemical reactions and so on.

This is just emergence, though? How is emergence related to completeness?

This happens over and over with the relatively new popularization of a theory: the theory is proposed to be the solution to every missing thing in the same rough conceptual vector.

It takes a lot more than just pointing in the general direction of complexity to propose the creation of a complete system, something which with present systems of understanding appears to be impossible.

> How is emergence related to completeness?

I didn't make that argument. I think, the original conclusion above isn't reasonable. However, "a concert" isn't "just" a bigger system either, which is my point.

It just depends on your definition of system, doesn’t it?

Sort of, the guardrail here IMO is you have an ontology processor that basically routes to a submodule, and if there isn't a submodule present it errors out. It is one large system, but it's bounded by an understanding of its own knowledge.

Concerts - again plural. And naturally you only bring in appropriate instruments.

Turtles all the way down?

A collection of systems is itself a system. The theorem would not recognize the distinction.

I believe, neither the expansion of Gödel's theorems to "everything", non-formalized systems, nor the conclusion of a resolution by harnessing multiple systems in concert, are sound reasoning. I think, it's a fallacious reductionism.

What is a non-formalized system?

I am very curious on this. In particular, if you are able to split systems into formalized and non formalized, then I thinks there are quite some praise and a central spot in all future history books for you!

I am not a native speaker, so please don't get hung up on particular expressions.

I meant, the colloquial philosophies and general ontology are not subject of Gödel's work. I think, the forgone expansion is similar to finding evidence for telepathy in the pop-sci descriptions of quantum entanglement. Gödel's theorems cover axiomatic, formal systems in mathematics. To apply it to whatever, you first have to formalize whatever. Otherwise, it's an intuition/speculation, not sound reasoning. At least, that's my understanding.

Further reading: https://en.wikipedia.org/wiki/G%C3%B6del's_incompleteness_th...

[deleted]

[dead]

Yep - when you use a multiplum of systems, then some systems can be regarded complete while other systems are sound.

This is in contrast to just one system that attempts to be sound and complete.

Could you define sound and complete in this context ? IIRC Rust's borrow checker is sound (will not mark something dysfunctional as functional) but not complete: some programs would take too long to verify, the checker times out, and compilation fails even though the program is potentially correct.

The meaning of the word person is ~sound (ie. Well defined) when two lawyers speak.

But when a doctor tells the lawyer that they operated a person, the lawyer can reasonably say "huh" - the concept of a person has shifted with the context.

[deleted]

Agreed. About the strongest we can hope for are causal mechanisms, and most of those will be at most hypotheses and/or partial explanations that only apply under certain conditions.

Honestly, I don’t know understand how these so-ontologies have persisted. Who is investing in this space, and why?

[deleted]
[deleted]

It's pretty easy to outline a high level ontology and let LLMs annotate/link it into something pretty useful, you can even have a benchmark suite using that ontology via LLM as a judge to progressively optimize it.

What is an ontology exactly? I see Palantir talking about it all the time and it just sounds like vague marketing.

These days anyone can spin up a developer account and check it out. Near as I could tell, you can create abstract 'objects' and link them to datasets/columns in the environment. And then you can link objects together. It's basically just an ER modeling tool, but they have great sales and seemed to have convinced people that they are constructing ontologies.

It comes from "the knowledge of being," and has been used to describe real-world knowledge representation, in particular hierarchical(-ish) semantic networks in AI since its early days.

When I see Palantir talk about it in a press release is that something real or just fluffy marketing?

Could be real. Such knowledge representation has been used in many systems. In limited domains, it can be useful.

As I understand it, this is a dataset of claimed causation. It should contain vaccines->autism, not because it's true, but because someone, in public, claimed that it was.

So, by design, it's pretty useless for finding new, true causes. But maybe it's useful for something else, such as teaching a model what a causal claim is in a deeper sense? Or mapping out causal claims which are related somehow? Or conflicting? Either way, it's about humans, not about ontological truth.

Also, it seems to mistake some definitions as causes.

A coronavirus isn't "claimed" to cause SARS. Rather, SARS is a name given to the disease cause by a certain coronavirus. Or alternatively, the name SARS-nCov-1 is the name given to the virus which causes SARS. Whichever way you want to see it.

For a more obvious example, saying "influenza virus causes influenza" is a tautology, not a causal relationship. If influenza virus doesn't cause influenza disease, then there is no such thing as an influenza virus.

Yes, I agree there are a lot of definitions or descriptions masquerading as explanations, especially in medicine and psychology. I think maybe insurance has a lot to do that. If you just describe a lot of symptoms, insurance won't know whether to cover it or not. But if you authoritatively name that symptom set as "BWZK syndrome" or something, and suddenly switch to assuming "BWZK syndrome" is a thing, the unknown cause to the symptoms, then insurance has something it can deal with.

But this description->explanation thing, whatever the reason, is just another error people make. It's not that different from errors like "vaccines cause autism". Any dataset collecting causal claims people make is going to contain a lot of nonsense.

I have some faith in this process. With enough facts, you get contradictions. Weighing contradicting vectors is a way of making decisions. So overall collecting a bunch of weakly connected facts might actually be useful. I'd like to see that in action.

I'm actively working with ontologies (disclaimer: as a researcher), and yours is the top comment, so I'll try to make some counterclaims here. No relation to this work tho.

> Ontologies and all that have been tried and have always been found to be too brittle.

I'd invite you to look at ontologies as nothing more than representations of things we know in some text-based format. If you've ever written an if statement, used OOP, trained a decision tree, or sketched an ER diagram, you've also represented known things in a particular text-based format.

We probably can agree that all these things are ubiquitous and provide value. It's just that those representations are not serialized as OWL/RDF, claim less about being accurate models of real-world things, and are often coupled with other things (i.e., functions).

This may seem reductionist in the sense of "we're all made of atoms", but I think it's important to understand why ontologies as a concept stick: they provide atomic components for expressing any knowledge in a dedicated place, and reasoning about it. Maybe the serializations, engines, results or creators suck, or maybe codebase + database is enough for most needs, but it's hard to not see the value of having some deterministic knowledge about a domain.

If you take _ontology_ to mean OWL/RDF, this paper wouldn't qualify, so I'm assuming you took the broader meaning (i.e., _semantic triples_).

> Take the examples from the front page (which I expect to be among the best in their set)

Most scientific work will be in-progress, not WordNet-level (which also needs a lot of funding to get there). You ideally want to show a very simple example, and then provide representative examples that signal the level of quality that other contributors/scientists can expect.

Here, they're explicit about creating triples of whatever causal statements they found on Wikipedia. I wouldn't expect it to be immediately useful to me, unless I dedicate time to prune and iron out things of interest.

> human_activity => climate change. Those are such a broad concepts that it's practically useless.

Disagree. If you had one metric that aggregated different measurements of climate change-inducing human activity, and one metric that did the same for climate change, you could create some predictions about N-order effects from climate change. Statistical analysis anyway requires you to make an assumption about the causal relationship behind what you're investigating.

So, if this the level of detail you need, this helps you potentially find new hypotheses just based on Nth order causal relations in Wikipedia text. It's also valuable to show where there is not enough detail.

> Or disease => death. There's no nuance at all.

Aside from my point above - haven't looked at the source data, but I doubt it stops at that level. But even if it does, it's 11 million things with provenance you can play with or add detail to.

Or you can also show that your method or choice of source data gets more conceptual/causal detail out of Wikipedia, or that their approach isn't replicable, or that they did a bad job, etc. These are all very useful contributions.

I’m not sure trying to tease out high-integrity information from Wikipedia is a useful contribution at all. Our criteria of proof is whatever a private clique of wiki editors or worse their security-complex handlers say? I feel like LLMs have already achieved this and the results are about what you would expect.

> I'd invite you to look at ontologies as nothing more than representations of things we know in some text-based format.

That's because we know how to interpret the concepts used in these representations, in relation to each other. It's just a syntactic change.

You might have a point if it's used as a kind of search engine: "show me wikipedia articles where X causes Y?" (although there is at least one source besides wikipedia, but you get my drift).

> Aside from my point above - haven't looked at the source data, but I doubt it stops at that level.

It does. It isn't even a triple, it's a pair: (cause, effect). There's no other relation than "causes". And if I skimmed the article correctly, they just take noun phrases and slap an underscore between the words and call it a concept. There's no meaning attached to the labels.

But the higher-order causations you mention are going to be pretty useless if there's no way on how to interpret them. It'll only work for highly specialized, unambiguous concepts, like myxomatosis (which is akin to encoding knowledge in the labels themselves), and the broad nature of many of the concepts will lead to quickly decaying usefulness when the length of the path increases. Here are some random examples (length 4 and 8, no posterior selection) from their "precision" set (197k pairs):

    ['mistake', 'deaths', 'riots', 'violence']
    ['higher_operating_income', 'increase_in_operating_income', 'increase_in_net_income', 'increase']
    ['mail_delivery', 'delays', 'decline_in_revenue', 'decrease']
    ['wastewater', 'environmental_problems', 'problems', 'treatment']
    ['sensor', 'alarm', 'alarm', 'alarm']
    ['thatch', 'problems', 'cost_overruns', 'project_delays']
    ['smoking_pot', 'lung_cancer', 'shortness_of_breath', 'conditions']
    ['older_medications', 'side_effects', 'physical_damage', 'loss']
    ['less_fat', 'weight_loss', 'death', 'uncertainties']
    ['diesel_particles', 'cancer', 'damages', 'injuries']
    ['malfunction_in_the_heating_unit', 'fire', 'fire_damage', 'claims']
    ['drug-resistant_malaria', 'deaths', 'violence', 'extreme_poverty']
    ['fairness_in_circumstances', 'stress', 'backache', 'aching_muscles']
    ['curved_spine', 'back_pain', 'difficulties', 'stress', 'difficulties', 'delay', 'problem', 'serious_complications']
    ['obama', 'high_gas_prices', 'recession', 'hardship', 'happiness', 'success', 'promotions', 'bonuses']
    ['financial_devastation', 'bankruptcy', 'stigma', 'homelessness', 'health_problems', 'deaths', 'pain', 'quality_of_life']
    ['methylmercury', 'neurological_damage', 'seizures', 'changes', 'crisis', 'growth', 'problems', 'birth_defects']
The latter is probably correct, but the chain of reasoning is false...

This one is cherry-picked, but I found it to funny to omit:

    ['agnosticism', 'despair', 'feelings', 'aggression', 'action', 'riot', 'arrest', 'embarrassment', 'problems', 'black_holes']

[dead]

This might be of at least some value to augment training LLMs? I spent a lot of time in the 1980s and early 1990s using symbolic AI techniques: conceptual dependency, NLP, expert systems, etc. While two large and well funded expert system projects I worked on (paid for by DARPA and PacBell) worked well, mostly symbolic AI was brittle and required what seemed like an i finite amount of human labor.

LLMs are such a huge improvement that the only real use I see in projects like Cause et, the defunct OpenCyc project, etc. the only possible practical use might be as a little extra training data.

Might as well go ahead and add https://tylervigen.com/spurious-correlations?page=135 from the looks of it.

This reminds me of an article I read that was posted on HN only a few days ago: Uncertain<T>[1]. I think that a causality graph like this necessarily needs a concept of uncertainty to preserve nuance. I don't know whether this would be practical in terms of compute, but I'd think combining traditional NLP techniques with LLM analysis may make it so?

[1] https://github.com/mattt/Uncertain

I get some vibes of fuzzy logic from this project.

Currently a lot of people research goes in the direction that there is "data uncertainty" and "measurement uncertainty", or "aleatoric/epistemic" uncertainty.

I foumd this tutorial (but for computer vision ) to be very intuitive and gives a good understanding how to use those concepts in other fields: https://arxiv.org/abs/1703.04977

Right. The first example on the site shows disease as a cause, and death as an effect. This is wrong on several levels: There is no such thing as healthy or sick. You’re always fighting off something, it just becomes obvious sometimes. Also, a disease doesn’t necessarily lead to death, obviously.

Since you're always going to die, the problem is solved - the implication is true by the right side always being true, and the left side doesn't matter.

Then it’s correlation instead of causation and the entire premise of a causation graph is moot.

> CauseNet aims at creating a causal knowledge base that comprises all human causal knowledge and to separate it from mere causal beliefs

Pretty bold to use a picture of philosophers as your splash page and then make a casual claim like this. To say the least, this is an impossible task!

The tech looks cool and I'm excited to see how I might be able to work it into my stuff and/or contribute. But I'd encourage the authors to reign in the rhetoric...

Indeed. I can't take an epistemology project seriously if it has no humility.

Building a perfectly accurate model of the world isn't possible. We need to create tools that make it easier for regular people to build more accurate models, not delude ourselves with dreams of perfection.

Well of course because no such model of the world can or does exist

This made me think of a much more interesting project. A compendium of information automatically extracted from research articles.

Essentially one totalizing meta analysis.

E.g. If it reads an article about the relationship between height and various life outcomes in Indonesian men, then first, it would store the average height of Indonesian men, the relationship between the average height of Indonesian men and each life outcome in Indonesian men, the type of relationship (e.g. Pearson's correlation), the relationship values (r value), etc. It would store the entity, the relationship, the relationship values, and the doi source.

Something like a quantitative Wikipedia.

Why not use PROLOG then, is the essence of cause and effect in programming. And also can expound syllogisms.

The conditional relation represented in prolog, and in any deductive system, is material implication (~PvQ), not causation. You can encode causal relationships with material implication but you’re still going to need to discover those causal relationships in the world somehow.

Conditional statements don't really work because "if A, then B" means that A is sufficient for B, but "A causes B" doesn't imply that A is sufficient for B. E.g. in "Smoking causes cancer", where smoking is a partial cause for cancer, or cancer partially an effect of smoking.

"A causes B" usually implies that A and B are positively correlated, i.e. P(A and B) > P(A)×P(B), but even that isn't always the case, namely when there is some common cause which counteracts this correlation.

Thinking about this, it seems that if A causes B, the correlation between A and B is at least stronger than it would have been otherwise.

This counterfactual difference in correlation strength is plausibly the "causal strength" between A and B. Though it doesn't indicate the causal direction, as correlation is symmetric.

I didn't say one does not to discover the causal relationships, but once discovered, such relationships can be explored and followed and _inferred_ on in a very syllogistic manner. My comment was really about the proposal in the article.

On the other hand, what we seem to have with LLM models, and the transformer approach in particular, is a sort of probable statistical correlation, calculated by brute-forcing and approximation (the gradient descent). So this is not true causation also, it becomes one only after a human observes it and agrees it follows certain causality.

/Not sure whether I can state that it is also material but in another non-logical sense, perhaps would sound nonsensical, but the apparent logical structure in the LLM production rather emerges from training patterns, not from explicit logical operations./

There's nothing wrong having a graphical structure which models causality, and of course - this needs to be discovered first. But then we have LZW/Sequitur using very brute-force way in order to find the minimal grammar for compressing certain data lossless-ly, thus discovering some logical structure (and correlation), but this is not yet causation. Indeed finding patterns != finding causal relationships.

My gut feeling is we want something that would result in a correct PROLOG-like set of inference rules, but based on actual causality, not conflating correlation. And then this - for a larger corpus - world's knowledge, but we don't have the means (yet) to figure out the correlation, even though approaches exist for smaller corpus.

It is perhaps the gradient descent and the fact that this composition of tensor algebra is differentiable that is the ingenious thing about the ML we deal with now, but everyone is dreaming of some magic algo which would allow finding the causation so that it results in non-probabilistic graphical model, or at least a model that we can follow the stochastic branching on in a observable manner.

It is indeed ingenious to fold multi-dimensional spaces, multiple times, in order to disambiguate the curvature of bunny's ear from the one of bear's ear. But it just does not feel right to do logic and causation by means of differential calculus and stochastic structures.

"The map is not the territory" ensures that bias and mistakes are inextricable from the entire AI project. I don't want to get all Jaron Lanier about it, but they're fundamental terms in the vocabulary of simulated intelligence.

Reminds me of the early attempts at hand categorising knowledge for AI

The associated paper references Judea Pearl's theories on causality, but curiously doesn't mention the DoWhy implementation [0], which seems to have some recognition in the causal inference space.

[0] https://github.com/py-why/dowhy

I wonder how they will quantize causality. Sometimes a particular cause has different, and even opposite, effects.

Alcohol causes anxiety. At the same time it causes relaxation. These effects depend on time frame, and many individual circumstances.

This is a single example but the world is full of them. Codifying causality will involve a certain amount of bias and belief. That does not lead to a better world.

I find the simple expression of a causes b as in this database without qualification not very helpful. At least, we need causal graphs/causal digram loops to describe these causal relationships better.

[0] https://en.wikipedia.org/wiki/Causal_graph

Harvard has a free course about it: https://www.edx.org/learn/data-analysis/harvard-university-c...

It's nice to see more semantic web experiments. I always wanted to do more reasoning with ontologies, etc., and it's such an amazing idea, to reference objects/persons/locations/concepts from the real world with uris and just add labeled arrows between them.

This is such a cool schemaless approach and has so much potential for open data linking, classical reasoning, LLM reasoning. But open data (together with RSS) has been dead for a while as all big companies have become just data hoarders. And frankly, while the concept and the possibilities are so cool, the graph databases are just not that fast and also not fun to program.

semantic web/OWL was always way too heavy to imagine humans using, you could imagine AI doing the heavy lifting here though..

Reminds me of the cyc project. https://en.wikipedia.org/wiki/Cyc

Organizing all knowledge requires a flexible system of organization (starting with how the categories are organized and accessed, not the data).

Random thoughts about organizing knowledge:

- Categories need fractal structures.

- Categories need to be available as subcategories to other categories as a pattern.

- Words need to be broken down into base concepts and used as patterns.

- Social information and context alter the meaning of words in many cases, so any semantic web without a control system has limited use as an organization tool.

I don’t know if it’s inadvertent, but it’s headed toward just becoming an engine for over fitted generalizations. Each casual pair will just emerge based on frequency, which will reinforce itself in preemptively and prematurely classifying all future information.

Unfortunately, frequency is the primary way AI works, but it will never be accurate for causality because causality always has the dynamic that things can happen just “because”. It’s hacked into LLMs via deliberate randomness in next-token prediction.

This is difficult, but then I just had someone earnestly inform me that the covid virus doesn't cause covid, so I think there's a need here, if only to have an automated way of identifying idiots.

As a trader, I’m always trying to piece together all the causal factors that move markets. It’s like building a mental map of cause and effect so I can make sharper, faster decisions

[deleted]

The fact that they are using Wikipedia for a primary data source exempts them from any further serious consideration.

A cool idea, in desperate need of an example use case.

I was hoping this would be actual normalized time series data and correlation ratios. Such a dataset would be interesting for forecasting.

I know it's a reductive take to point to a single mistake and act like the whole project might be a bit futile (maybe it's a rarity) but this example in their sample is really quite awful if the idea is to give AI better epistemics:

    {
        "causal_relation": {
            "cause": {
                "concept": "vaccines"
            },
            "effect": {
                "concept": "autism"
            }
        }
    },
... seriously? Then again, they do say these are just "causal beliefs" expressed on the internet, but seems like some stronger filtering of which beliefs to adopt ought to be exercised for an downstream usecase.

In the precision dataset, there are the sentences that led to this, some are:

>> "Even though the article was fraudulent and was retracted, 1 in 4 parents still believe vaccines can cause autism."

>> On 28 February 1998 Horton published a controversial paper by Dr. Andrew Wakefield and 12 co-authors with the title "Ileal-lymphoid-nodular hyperplasia, non-specific colitis, and pervasive developmental disorder in children" suggesting that vaccines could cause autism.

>> He was opposed by vaccine critics, many of whom believe vaccines cause autism, a belief that has been rejected by major medical journals and professional societies.

All that I've seen don't actually say that vaccines cause autism

Oh, ouch, yeah. We already know that misinformation tends to get amplified, the last thing we need is a starting point full of harmful misinformation. There are lots of "causal beliefs" on the internet that should have no place in any kind of general dataset.

It's even worse than that, because the way they extract the causal link is just a regex, so

"vaccines > autism"

because

"Even though the article was fraudulent and was retracted, 1 in 4 parents still believe vaccines can cause autism."

I think this could be solved much better by using even a modestly powerful LLM to do the causal extraction... The website claims "an estimated extraction precision of 83% " but I doubt this is an even remotely sensible estimate.

Can't an LLM extract this type of information with reasonably high accuracy?

the cyc of this current ai winter

Causality is literally impossible to deduce...

[deleted]

Wittgenstein is calling

I wonder what is this for.

[deleted]

this will be super cool if it can be done!

I think this is many years old

[dead]