I know it's a reductive take to point to a single mistake and act like the whole project might be a bit futile (maybe it's a rarity) but this example in their sample is really quite awful if the idea is to give AI better epistemics:

    {
        "causal_relation": {
            "cause": {
                "concept": "vaccines"
            },
            "effect": {
                "concept": "autism"
            }
        }
    },
... seriously? Then again, they do say these are just "causal beliefs" expressed on the internet, but seems like some stronger filtering of which beliefs to adopt ought to be exercised for an downstream usecase.

In the precision dataset, there are the sentences that led to this, some are:

>> "Even though the article was fraudulent and was retracted, 1 in 4 parents still believe vaccines can cause autism."

>> On 28 February 1998 Horton published a controversial paper by Dr. Andrew Wakefield and 12 co-authors with the title "Ileal-lymphoid-nodular hyperplasia, non-specific colitis, and pervasive developmental disorder in children" suggesting that vaccines could cause autism.

>> He was opposed by vaccine critics, many of whom believe vaccines cause autism, a belief that has been rejected by major medical journals and professional societies.

All that I've seen don't actually say that vaccines cause autism

Oh, ouch, yeah. We already know that misinformation tends to get amplified, the last thing we need is a starting point full of harmful misinformation. There are lots of "causal beliefs" on the internet that should have no place in any kind of general dataset.

It's even worse than that, because the way they extract the causal link is just a regex, so

"vaccines > autism"

because

"Even though the article was fraudulent and was retracted, 1 in 4 parents still believe vaccines can cause autism."

I think this could be solved much better by using even a modestly powerful LLM to do the causal extraction... The website claims "an estimated extraction precision of 83% " but I doubt this is an even remotely sensible estimate.