Three things explain this. First, while models beat humans on benchmarks, the standardized tests designed to measure AI performance, they struggle to replicate this performance in hospital conditions. Most tools can only diagnose abnormalities that are common in training data, and models often don’t work as well outside of their test conditions. Second, attempts to give models more tasks have run into legal hurdles: regulators and medical insurers so far are reluctant to approve or cover fully autonomous radiology models. Third, even when they do diagnose accurately, models replace only a small share of a radiologist’s job. Human radiologists spend a minority of their time on diagnostics and the majority on other activities, like talking to patients and fellow clinicians
From the article
Another key extract from the article
> The performance of a tool can drop as much as 20 percentage points when it is tested out of sample, on data from other hospitals. In one study, a pneumonia detection model trained on chest X-rays from a single hospital performed substantially worse when tested at a different hospital.
That screams of over fitting to the training data.
Because that is literally happening. I did a bit of work developing some radiological models and sample size for healthy vs malignant is usually 4 to 1. Then you modify the error function so that it makes malignants more significant (you are quite often working with datasets as low as 500 images, so 80/20 training validation split means you are left with 80 examples of malignant) which means that as soon as you take a realistic sample where one specific condition maybe appears in 1/100 or 1/1000 the false positives make your model practically useless.
Of course SOTA models are much better, but getting medical data is quite difficult and expensive so there is not a lot of them.