If the current state is anything to go by, an automated test would not only flag your out of distribution results but try to gaslight everyone reading its output with additional false indicators to map you into an area that's in distribution. Statistical models cannot accept the existence of extremely rare edge cases.

Modern LLMs routinely beat human doctors at diagnosing "extremely rare edge cases".

They have unmatched breadth of knowledge by default, and can maintain attention across entire medical histories.

Citation very much needed.

https://www.reddit.com/r/ChatGPT/comments/1iz4iwm/chatgpt_is...

or

https://www.reddit.com/r/ChatGPT/comments/1oesnix/chatgpt_di...

or if you prefer from this site,

https://news.ycombinator.com/item?id=43171639

and

https://news.ycombinator.com/item?id=42999632

If you were looking for a published paper or something more official though, I don't have one.

Maybe something that isn't completely censored anecdata? At best these fall into "well known diseases with obvious symptoms that overworked, incompetent, or simply sexist, human doctors missed" and not actual rare cases.

> Modern LLMs routinely beat human doctors at diagnosing "extremely rare edge cases".

There is a selection bias here. Not saying it wouldn’t work, but right now you hear about exceptional cases, not when the LLM wants to amputate for a wart.

We all work with LLMs, right? It hasn’t been long at all since an LLM gaslit me while attempting to recover an unbootable laptop. I should have been recommended a few simple steps to try; instead, it was unable to ignore the irrelevant details and led me on an hours-long chase. To me that means the LLM will also struggle to ignore irrelevant medical information.