I don't understand this reasoning. Randomizing people to AI vs standard of care is expensive and risky. Checking whether the AI can pass hypothetical scenarios seems like a perfectly reasonable approach to researching the safety of these models before running a clinical trial.
The issue is that those hypothetical scenarios do not have to look like how patients actually interact with the tool.
Real life use is full of ill posed questions open ended statements inaccurate assessment of symptoms, and conclusory remarks sprinkled in between. Real use of chat bots for Health by non-clinicians looks very different than scenario based evaluation.
You would pass those hypothetical scenarios to doctors too, and then the analyses of results would be done by doctors who don't know if it's an AI or doctor result.
From the paper
> Three physicians independently assigned gold-standard triage levels based on cited clinical guidelines and clinical expertise, with high inter-rater agreement
You can start by comparing "doctor" care vs "doctor who also uses AI" care