When an LLM must run the interview itself: an exam-inspired benchmark shows interactive diagnostic reasoning degrades performance (Zhan & Gan 2026, arXiv)
Chen Zhan, Xihe Qiu, Xiaoyu Tan, Xibing Zhuang, Gengchen Ma, Yue Zhang, Shuo Li, Peifeng Liu, Xiaoxiao Ge, Liang Liu and Lu Gan (with supervision, funding and revision attributed to Xihe Qiu, Xiaoxiao Ge, Liang Liu and Lu Gan) post on arXiv on 21 May 2026 an "OSCE-inspired" benchmark: a standardized patient simulator before which fifteen large language models (LLMs: models trained to predict text, used here as clinical-reasoning assistants) must, like a medical student in a clinical exam, run the interview themselves before making a diagnosis. Across 468 cases, this interactive mode — asking the questions yourself, turn by turn — lowers diagnostic accuracy by 12.75% and supporting-evidence quality by 24.36% relative to the setting where all information is supplied upfront, with errors driven mainly by premature diagnostic closure and inefficient questioning. The lesson is sober and useful: rankings on static medical multiple-choice exams likely overstate what these models can do in a real consultation. Three caveats accompany the finding — a patient simulator that is itself algorithmic, a case provenance not specified in the accessible version, and figures reported as relative values without a human reference.
The context
For four years, LLM performance in medicine has been measured mostly on written exams: USMLE-style multiple-choice questions (the US medical licensing exam), datasets such as MedQA, closed clinical vignettes. On these tests the best models now exceed the human passing threshold, fueling a wave of "AI doctor" announcements. But these tests share one trait: all the useful information is laid out upfront in the prompt. The model receives age, history, symptoms, lab results, then picks an answer. A real consultation does not work this way: the clinician starts from a vague complaint, must decide which questions to ask, which tests to order, when to stop — sequential reasoning under uncertainty, where the art lies as much in seeking information as in processing it.
This is the gap the paper targets. It belongs to a recent line of work on interactive evaluation of clinical models, simulating a patient-doctor dialogue rather than a multiple-choice question. The claimed novelty is an OSCE-inspired framework: the OSCE (Objective Structured Clinical Examination) is the test in which a medical student faces a "standardized patient" — an actor trained to play a case — and is graded on history-taking, examination and reasoning. By transposing this format to LLMs, the authors aim to measure not what the model knows when handed everything, but what it manages to uncover when it must ask the right questions.
The method
The preprint (arXiv:2605.22047, 10.48550/arXiv.2605.22047), posted on 21 May 2026 under a CC BY 4.0 license (reuse and adaptation allowed with attribution — a favorable point we return to), builds two components. First a standardized patient simulator: an agent that plays the patient, answers the tested model's questions, and reveals information only as it is requested. Second an active diagnostic inquiry protocol, controlled and reproducible, in which the LLM runs a multi-turn dialogue and then states a diagnosis. The authors' precise affiliations, the exact engine behind the simulator, and the named list of the fifteen models do not appear in the accessible abstract; we will not invent them and flag these as items to verify in the full manuscript.
The benchmark comprises 468 cases and fifteen models, proprietary and open-source. For each case, two settings are compared. In the full-context setting, the whole record is given to the model upfront, as in a classic multiple-choice exam — the idealized upper bound. In the active setting, the model initially sees only a presenting complaint and must query the simulator, turn by turn, to reconstruct the information before concluding. Two quantities are measured: diagnostic accuracy (is the final diagnosis correct?) and supporting-evidence quality (are the elements cited in support of the diagnosis relevant and sufficient?). An error analysis then categorizes the failures.
This dual measure is more demanding than a single score: a model may hit the right diagnosis for the wrong reasons, or by leaning on evidence it did not actually gather. Separating accuracy from reasoning quality is precisely what distinguishes a serious clinical evaluation from an answer-matching contest.
The results
The central result is a clear gap between the two settings. Moving from full context to active inquiry, diagnostic accuracy falls by 12.75% and supporting-evidence quality falls by 24.36% (values reported relative to the full-context setting). In other words, the drop hits reasoning even more than the verdict: not only do the models get the diagnosis wrong more often, but above all they justify their proposed diagnosis far less well. The error analysis attributes these drops to two behaviors: premature diagnostic closure — the model settles on a hypothesis too early, before gathering enough to confirm or rule it out — and inefficient questioning — it asks low-information questions, or fails to ask decisive ones. Notably, these are two cognitive biases well described in novice human clinicians; the LLMs reproduce them.
Clinical translation. Since this is a benchmark and not a patient trial, the translation is about interpretation rather than a count of lives. The idea to keep: across a set of consultations where the model must gather the history itself, roughly one correct answer in eight (in relative terms) is lost compared with the ideal case where it is handed the full record, and nearly a quarter of the quality of the justifying reasoning evaporates. For a tool meant to assist a physician in a real exchange, this is not a detail: the performance shown on written exams describes the upper bound of a well-fed model, not its behavior when it must conduct the interview. These figures remain relative averages, however: without the absolute values, the spread across models, or confidence intervals in the abstract, they signal a robust trend, not a risk measure transposable as-is to a given patient.
What works well
The evaluation targets the right problem. The main weakness of current rankings is that they test knowledge delivered ready-made, not the ability to investigate. By adopting an OSCE format — taking the history from a standardized patient before concluding — the paper measures a skill that truly matters in the clinic and that multiple-choice exams ignore. This is exactly the kind of methodological guardrail missing from the "AI passes the medical exam" literature.
The dual metric separates verdict from reasoning. Measuring both diagnostic accuracy and supporting-evidence quality, then categorizing errors (premature closure, inefficient questioning), yields a diagnosis of the models, not just a grade. That evidence quality falls more (−24.36%) than accuracy (−12.75%) is a valuable observation: it suggests some "correct" diagnoses in active mode are reached without solid reasoning, which a plain success rate would have hidden.
Scale, reproducibility and an open license. Fifteen models, proprietary and open, across 468 cases, in a protocol described as controlled and reproducible: broad enough that the trend does not hinge on one model or a handful of cases. And release under CC BY 4.0 — which permits reuse and adaptation with attribution — makes it easy for other teams to take up the benchmark, unlike the non-commercial, no-derivatives licenses that lock up part of the literature.
What works less well
The patient is simulated, and the simulator is itself a model. The realism of the test depends entirely on the quality of the standardized patient. If it is driven by an LLM, the evaluation becomes partly circular: one model interrogates another, and the two may share the same blind spots (same training data, same phrasings). This is a variant of the population bias failure mode applied to evaluation: a simulated patient is not a real one, with messy narratives, omissions, comorbidities and ambiguous wording. External validity — would the performance transfer to real interviews? — therefore remains to be established, and the abstract announces no validation on authentic clinical dialogues.
The provenance of the 468 cases is unspecified, hence a contamination risk. If these cases derive from public collections (vignettes, case banks, open medical datasets), the fifteen models may have seen them during training. This is the data leakage failure mode transposed to LLMs, known as data contamination: the "full-context" upper bound would then be artificially inflated by memorization, mechanically exaggerating the gap with active mode. Until the origin of the cases and the contamination controls are documented in the full text, the 12.75% figure should be read as a difference between two settings, not a pure measure of how hard it is to investigate.
Relative percentages, with no human comparator or absolute values. The abstract gives relative drops (−12.75%, −24.36%) without the baseline absolute accuracy, the spread across models, or confidence intervals. This is a cousin of the misleading metric: an impressive relative drop can hide very different realities depending on the baseline level. Above all, a human comparator under the same protocol is missing: how many correct answers does a physician also lose between a complete record and an interview to be conducted? Without that reference, we know LLMs degrade in interactive mode, but not whether they degrade more or less than a clinician — and it is that comparison that would decide their usefulness as an assistant.
What this changes
For the research community, the message is a call to change the unit of measurement. As long as clinical models are ranked on static multiple-choice exams, the progress on display risks overstating real aptitude. This kind of interactive benchmark — and, better, its open release under CC BY 4.0 — provides a complement other teams can take up, extend to real dialogues, and harden against contamination. The natural next step is a version with real patients or authentic transcripts, and a human comparison arm.
For clinicians, it is a useful confirmation of bedside intuition: a tool that answers a complete vignette brilliantly is not thereby a good interview partner. The premature diagnostic closure and inefficient questioning the models display are exactly the traps residents are taught to avoid. Concretely, none of these systems is today approved as a medical device (no CE marking, no FDA clearance, no favorable opinion from France's Haute Autorité de Santé) to conduct a history independently, and this paper explains why caution remains warranted.
For patients and the public, the lesson is direct: a conversational agent that seems to "know medicine" when you describe everything at once may err more when it must, like a real caregiver, ask the right questions at the right time. Consumer "symptom checker" tools built on LLMs inherit this limit. They can inform and orient, but do not replace the clinical interview — and the diagnostic decision remains a professional's responsibility.
Further reading
The preprint is openly available on arXiv: arxiv.org/abs/2605.22047 (DOI 10.48550/arXiv.2605.22047), under a CC BY 4.0 license. On the limits of LLMs in clinical safety, see our analysis of the Auger 2026 study on an LLM's clinical safety frontier in multiple sclerosis. On how the format of an LLM's imaging answer can fool evaluation, see our analysis of Spitzer 2026 on the effect of explanation format in radiology.