Hacker News

I'm actively working with ontologies (disclaimer: as a researcher), and yours is the top comment, so I'll try to make some counterclaims here. No relation to this work tho.

> Ontologies and all that have been tried and have always been found to be too brittle.

I'd invite you to look at ontologies as nothing more than representations of things we know in some text-based format. If you've ever written an if statement, used OOP, trained a decision tree, or sketched an ER diagram, you've also represented known things in a particular text-based format.

We probably can agree that all these things are ubiquitous and provide value. It's just that those representations are not serialized as OWL/RDF, claim less about being accurate models of real-world things, and are often coupled with other things (i.e., functions).

This may seem reductionist in the sense of "we're all made of atoms", but I think it's important to understand why ontologies as a concept stick: they provide atomic components for expressing any knowledge in a dedicated place, and reasoning about it. Maybe the serializations, engines, results or creators suck, or maybe codebase + database is enough for most needs, but it's hard to not see the value of having some deterministic knowledge about a domain.

If you take _ontology_ to mean OWL/RDF, this paper wouldn't qualify, so I'm assuming you took the broader meaning (i.e., _semantic triples_).

> Take the examples from the front page (which I expect to be among the best in their set)

Most scientific work will be in-progress, not WordNet-level (which also needs a lot of funding to get there). You ideally want to show a very simple example, and then provide representative examples that signal the level of quality that other contributors/scientists can expect.

Here, they're explicit about creating triples of whatever causal statements they found on Wikipedia. I wouldn't expect it to be immediately useful to me, unless I dedicate time to prune and iron out things of interest.

> human_activity => climate change. Those are such a broad concepts that it's practically useless.

Disagree. If you had one metric that aggregated different measurements of climate change-inducing human activity, and one metric that did the same for climate change, you could create some predictions about N-order effects from climate change. Statistical analysis anyway requires you to make an assumption about the causal relationship behind what you're investigating.

So, if this the level of detail you need, this helps you potentially find new hypotheses just based on Nth order causal relations in Wikipedia text. It's also valuable to show where there is not enough detail.

> Or disease => death. There's no nuance at all.

Aside from my point above - haven't looked at the source data, but I doubt it stops at that level. But even if it does, it's 11 million things with provenance you can play with or add detail to.

Or you can also show that your method or choice of source data gets more conceptual/causal detail out of Wikipedia, or that their approach isn't replicable, or that they did a bad job, etc. These are all very useful contributions.

['mistake', 'deaths', 'riots', 'violence'] ['higher_operating_income', 'increase_in_operating_income', 'increase_in_net_income', 'increase'] ['mail_delivery', 'delays', 'decline_in_revenue', 'decrease'] ['wastewater', 'environmental_problems', 'problems', 'treatment'] ['sensor', 'alarm', 'alarm', 'alarm'] ['thatch', 'problems', 'cost_overruns', 'project_delays'] ['smoking_pot', 'lung_cancer', 'shortness_of_breath', 'conditions'] ['older_medications', 'side_effects', 'physical_damage', 'loss'] ['less_fat', 'weight_loss', 'death', 'uncertainties'] ['diesel_particles', 'cancer', 'damages', 'injuries'] ['malfunction_in_the_heating_unit', 'fire', 'fire_damage', 'claims'] ['drug-resistant_malaria', 'deaths', 'violence', 'extreme_poverty'] ['fairness_in_circumstances', 'stress', 'backache', 'aching_muscles'] ['curved_spine', 'back_pain', 'difficulties', 'stress', 'difficulties', 'delay', 'problem', 'serious_complications'] ['obama', 'high_gas_prices', 'recession', 'hardship', 'happiness', 'success', 'promotions', 'bonuses'] ['financial_devastation', 'bankruptcy', 'stigma', 'homelessness', 'health_problems', 'deaths', 'pain', 'quality_of_life'] ['methylmercury', 'neurological_damage', 'seizures', 'changes', 'crisis', 'growth', 'problems', 'birth_defects']

stocksinsmocks 2 days ago [ - ]

I’m not sure trying to tease out high-integrity information from Wikipedia is a useful contribution at all. Our criteria of proof is whatever a private clique of wiki editors or worse their security-complex handlers say? I feel like LLMs have already achieved this and the results are about what you would expect.

tgv 2 days ago [ - ]

> I'd invite you to look at ontologies as nothing more than representations of things we know in some text-based format.

That's because we know how to interpret the concepts used in these representations, in relation to each other. It's just a syntactic change.

You might have a point if it's used as a kind of search engine: "show me wikipedia articles where X causes Y?" (although there is at least one source besides wikipedia, but you get my drift).

> Aside from my point above - haven't looked at the source data, but I doubt it stops at that level.

It does. It isn't even a triple, it's a pair: (cause, effect). There's no other relation than "causes". And if I skimmed the article correctly, they just take noun phrases and slap an underscore between the words and call it a concept. There's no meaning attached to the labels.

But the higher-order causations you mention are going to be pretty useless if there's no way on how to interpret them. It'll only work for highly specialized, unambiguous concepts, like myxomatosis (which is akin to encoding knowledge in the labels themselves), and the broad nature of many of the concepts will lead to quickly decaying usefulness when the length of the path increases. Here are some random examples (length 4 and 8, no posterior selection) from their "precision" set (197k pairs):

The latter is probably correct, but the chain of reasoning is false...

This one is cherry-picked, but I found it to funny to omit:

    ['agnosticism', 'despair', 'feelings', 'aggression', 'action', 'riot', 'arrest', 'embarrassment', 'problems', 'black_holes']