>"they struggle to replicate this performance in hospital conditions"
Are there systematic reasons why radiologists in hospitals are inaccurately assessing the AI's output? If the AI models are better than humans in testing novel data then, well, the thing that has changed in a hospital situation compared to the AI-Human testing environment is not the AI, it is the human, under less controlled constraints, additional pressures, workloads, etc. Perhaps the AI's aren't performing as poorly as thought. Perhaps this is why they performed better to begin with. Otherwise, production ML systems are generally not as highly regarded as these models when they perform as significantly below test data sets in production. Some is expected, but "struggle to replicate" implies more.
>"Most tools can only diagnose abnormalities that are common in training data"
Well yes, training on novel examples is one thing. Training on something categorically different is another thing all together. Also there are thresholds of detection. Detecting nothing, or with a a lower confidence, or unknown anomaly, false positive, etc. How much of the inaccuracy isn't wrong, but simply something that is amended or expanded upon when reviewed? Some details here would be useful.
I'm highly skeptical when generalized statements exclude directly relevant information to which an is referring. The few sources provided don't at all cover model accuracy, and the primary factor cited as problematic with AI review, lack of diversity in study composition for women, ethnic variation, children, links to a a meta study that was not at all related to the composition of models and their training data sets.
The article begins as what appears to be a criticism of AI accuracy with the thinness outlined above but then quickly moves on to a "but that's not what radiologists do anyway", and provides a categorical % breakdown of time spent where Personal/Meetings/Meals and some mixture of the others combine to form at least a third that could be categorized as "Time where the human isn't necessary if graphs are being interpreted by models."
I'm not saying there aren't points here, but overall, it simply sounds like the hand-wavy meandering of someone trying to gatekeep a profession whose services could be massively more utilized with more automation, and sure-- perhaps at even higher quality with more radiologists to boot-- but perfect is the enemy of the good etc. on that score, with enormous costs and delays in service in the meantime.
Poorer performance in real hospital settings has more to do with the introduction of new/unexpected/poor quality data (i.e. real world data) that the model was not trained in or optimized for. They still do very well generally, but often do not hit equivalent performance to what is submitted to the FDA, or in marketing materials. This does not mean they aren’t useful.
Clinical AI also has to balance accuracy with workflow efficiency. It may be technically most accurate for a model to report every potential abnormality with associated level of certainty, but this may inundate the radiologist with spurious findings that must be reviewed and rejected, slowing her down without adding clinical value. More data is not always better.
In order for the model to have high enough certainty to get the right balance of sensitivity and specificity to be useful, many many examples are needed for training, and with some rarer entities, that is difficult. It also inherently reduces the value of the model it is only expected to identify its target disease 3 times/year.
That’s not to say advances in AI won’t overcome these problems, just that they haven’t, yet.
For anomaly systems like this, is it effective to invert the problem by not include the ailment/problem in the training data, then looking for a "confused" signal rather than a "x% probability of ailment" type signal?
On that, I'm not sure. My area of ML & data science practice is, thankfully, not so high-stakes. There's a method of anomaly detection called one-class SVM (Support Vector Machine) that is pretty much this- train on normal, flag on "wtf is this you never training me on this 01010##" <-- Not actual ISO standard ML model output or medical jargon. But I'm not sure if that's what's most effective here. My gut instinct in first approaching the task would be to throw a bunch of models at it, mixed-methods, with one-class SVM as a fall back. But I'm also way out of my depth on medical diagnostics ML so that's just a generalist's guess.