médical IA

GPT-4 in radiology: why the format of an LLM's explanation changes physicians' diagnostic accuracy

Published on May 22, 2026 · 8 min read

Philipp Spitzer and colleagues published on 23 April 2026 in npj Digital Medicine a randomized trial comparing three formats of GPT-4 explanations on the diagnostic accuracy of 101 US radiologists, each reviewing 20 cases (2,020 assessments). Chain-of-thought explanations improve accuracy by 12.2 percentage points versus control (p = 0.001), while the differential diagnosis format — intuitively medical — adds nothing significant and induces a marked automation bias when the model is wrong. An important read because it shifts the question: it is no longer just "is the LLM good?" but "how do we let it share what it knows without imposing its mistakes?"

The context

Large language models (LLMs, generative language models trained on massive text corpora) reached high diagnostic performance on radiology cases in 2024-2025. GPT-4 (OpenAI's multimodal model released in late 2023), Med-PaLM 2, Claude and their successors now routinely exceed 70-80% accuracy on public benchmarks such as the NEJM Image Challenge or MedQA. The research question has shifted: it is no longer "do these models work?" but "how do we embed them in a workflow where they truly complement the radiologist rather than replace or mislead them?"

Several studies in 2024-2025 have begun documenting a counter-intuitive phenomenon: a standalone LLM can be more accurate than a radiologist + LLM pair, because the physician over-trusts the model when it is wrong (automation bias) or dismisses its good suggestions without understanding them. Medical decision support is not a new topic — expert systems of the 1980s-90s such as MYCIN or INTERNIST hit this same wall: a system that delivers an answer does not say why, and a clinician who does not know why cannot judge when to trust it. LLMs bring a major technical novelty here: they can generate a natural-language explanation alongside their prediction, in several different formats. But which format? No large-scale randomized trial had compared these formats until now.

The method

The study is led by Philipp Spitzer and Daniel Hendriks (co-first authors), in collaboration with a clinical team from the LMU Munich radiology department (Jan Rudolph, Sarah Schlaeger, Jens Ricke, Boj Friedrich Hoppe) and Stefan Feuerriegel (LMU). Public pre-registration on AsPredicted (reference 4tgb-sr3z), ethics approval LMU EK-MIS-2024-320.

The design is a parallel-group between-subjects randomized trial. 101 board-certified US radiologists, with mean 13.6 years of experience (SD 8.0), are randomly assigned to one of four arms. Each radiologist then evaluates 20 radiology cases drawn from the NEJM Image Challenge, presented as image plus short clinical vignette. Diagnosis is captured as free text — no multiple choice — then manually coded for typos. Total: 2,020 assessments.

The four arms are as follows. Control (n = 24): no LLM support, internet search allowed but no LLM use. Standard output (n = 24): GPT-4 provides a diagnosis without explanation ("the most likely diagnosis is X"), mean length 62.7 words. Differential diagnosis (n = 30): GPT-4 provides the top five hypotheses, ranked, with a short justification for each, mean length 208.6 words. Chain-of-thought (n = 23): GPT-4 provides its step-by-step reasoning before the final diagnosis, mean length 188.6 words.

The term chain-of-thought (CoT) refers to a prompting technique in which the model is explicitly asked to break down its reasoning into steps before answering. Documented since 2022 in general-purpose LLMs (Wei et al.), it improves performance on reasoning tasks and — the central point of this paper — the readability of the reasoning to a human user.

The model used is GPT-4 in its multimodal version (capable of processing image plus text). GPT-4 alone scores 75% on these 20 cases in standard output, 80% in chain-of-thought, and 65% top-1 / 80% top-5 in differential diagnosis.

The results

The main finding is a marked heterogeneous effect by explanation format.

The chain-of-thought format significantly improves radiologists' accuracy: +12.2 percentage points versus control (95% CI: 5.3 to 19.2; p = 0.001). It is the strongest effect observed in the study.

The standard output and differential diagnosis formats deliver nothing statistically significant versus control: respectively +5.0 pp (95% CI: -1.8 to 11.8; p = 0.150) and +2.5 pp (95% CI: -4.0 to 9.0; p = 0.446). Counter-intuitive: the differential diagnosis, despite being close to traditional medical reasoning, is the least useful.

Compared directly with the other formats, chain-of-thought remains on top: +7.2 pp vs standard output (p = 0.040) and +9.7 pp vs differential diagnosis (p = 0.004). GPT-4 alone outperforms all groups of radiologists, including those assisted by GPT-4 under any format. This result must be read carefully (see limitations), but it is consistent with a growing slice of the 2024-2025 literature.

Adherence to the LLM's suggestions is revealing. When GPT-4 is wrong, radiologists in the differential diagnosis arm still adopt its diagnosis 80% of the time; those in the standard output arm, 30.6%; those in the chain-of-thought arm, 30.4%. This gap suggests a precise mechanism: a structured five-hypothesis differential carries an appearance of methodological exhaustiveness that disarms the radiologist's critical judgment. This is the classic failure mode of automation bias (the documented human tendency to over-trust automated systems, especially when those systems look rigorous).

Clinical translation. On 1,000 radiology cases of comparable difficulty, an unassisted radiologist would correctly solve about 600 cases. The same radiologist assisted by GPT-4 with chain-of-thought would solve 722, and the same radiologist assisted by GPT-4 with standard output or differential diagnosis would solve only 605 to 625 — no practical difference. But when the LLM is wrong (and it is wrong about 25% of the time on this benchmark), the differential diagnosis format leads to almost twice as many adopted errors as the other two formats.

What is good

Three specific strengths.

The pre-registered randomized design. The study is publicly pre-registered on AsPredicted before data collection, which closes the door to p-hacking and post-hoc choice of favorable analyses. This methodological requirement is still far from systematic in the clinical-LLM literature and deserves recognition — most clinical model evaluations remain retrospective, post-hoc, and pick their metrics after seeing the data.

The comparator is fair. The control group is not deprived of everything: they have access to the internet, PubMed, any documentation that is not an LLM. This is the right comparator — the 2026 radiologist in real practice. LLM-versus-nothing-at-all comparisons, common in prior literature, systematically overstated LLM contribution by stripping physicians of their usual resources.

The sample size is credible. 101 board-certified radiologists with mean 13.6 years of experience and 2,020 independent assessments make up a sample comparable to large radiology decision-support trials. Statistical power to detect a 12 pp effect is solid. It is also one of the few studies in the field to recruit senior radiologists rather than residents.

What is less good

Three precise limitations to keep in mind.

This is a vignette study, not a real clinical workflow. Radiologists respond to 20 isolated cases, with minimal context, no full patient chart, no sequence of comparable cases on the same day, no realistic time pressure. Ecological validity is limited — a radiologist reading 80 scans on overnight call does not look like a radiologist answering 20 vignettes at their own pace from their desk. The authors acknowledge this and call for real-world studies. Any extrapolation to patient outcomes (mortality, morbidity, avoided exams) is still to be done.

Probable GPT-4 contamination. The cases come from the NEJM Image Challenge, which is public and dates back years. GPT-4 has very likely seen these cases and their solutions during training. The authors propose a memorization test and conclude that similarity scores are low, but the dependency remains a classic failure mode: this is data leakage applied to an LLM, which no simple similarity test can fully detect. GPT-4's absolute score (75-80%) must therefore be read with this caveat in mind — performance in the clinic on truly unseen cases will likely be lower.

The between-subjects design weakens inter-arm comparisons. Since each radiologist sees only one format, observed differences between arms can partly reflect differences between radiologists rather than between formats — especially at 23-30 per arm. A within-subjects design (each radiologist tests every format on comparable cases) would be much more powerful and is explicitly suggested by the authors as a follow-up. With 23 subjects in the chain-of-thought arm, a single particularly skilled radiologist shifts the arm's mean non-trivially. This is the classic failure mode of population bias under limited sampling.

Additional note: a single measurement time point, no longitudinal follow-up, and funding/competing-interest information is not accessible in the pre-publication version consulted.

What it changes

For the AI-health research community, the signal is clear: explanation format is not a UX detail, it is a major determinant of human-AI pair performance. The clinical-LLM literature has overwhelmingly focused on the model's raw score ("does the AI beat the physician?") while neglecting that in practice the physician will remain in charge and that their accuracy will depend on how the model expresses itself. Future evaluations should systematically compare multiple explanation formats, the way clinical trials compare drug doses. This is a new evaluation dimension to integrate into emerging guidelines such as TRIPOD-LLM or CLAIM.

For clinicians, the message is paradoxically encouraging and worrying. Encouraging: a well-chosen explanation format can add 12 percentage points of diagnostic accuracy, which is clinically substantial in a domain where every point counts. Worrying: the intuitively "medical" format (differential diagnosis) is precisely the one inducing the most dangerous over-confidence when the model is wrong. Any deployment of a clinical LLM will need to be validated in real conditions for its specific format, not only for its raw performance. Chain-of-thought is not a universal recipe: it worked here, in this context, with this model.

For patients and the public, the takeaway is more subtle. AI in radiology is neither the magical revolution of the press releases nor the placebo feared by skeptics. It is a technology that can help, that can harm, and whose real impact depends on interface choices that most commercial vendors do not document. Asking your hospital which model is used, in which format, and with which local validation becomes a legitimate question.