Poorer performance in real hospital settings has more to do with the introduction of new/unexpected/poor quality data (i.e. real world data) that the model was not trained in or optimized for. They still do very well generally, but often do not hit equivalent performance to what is submitted to the FDA, or in marketing materials. This does not mean they aren’t useful.

Clinical AI also has to balance accuracy with workflow efficiency. It may be technically most accurate for a model to report every potential abnormality with associated level of certainty, but this may inundate the radiologist with spurious findings that must be reviewed and rejected, slowing her down without adding clinical value. More data is not always better.

In order for the model to have high enough certainty to get the right balance of sensitivity and specificity to be useful, many many examples are needed for training, and with some rarer entities, that is difficult. It also inherently reduces the value of the model it is only expected to identify its target disease 3 times/year.

That’s not to say advances in AI won’t overcome these problems, just that they haven’t, yet.