médical IA

When text eats the image: what the Restrepo 2026 study reveals about the contextual fragility of clinical VLMs on MIMIC-CXR

Published on May 25, 2026 · 9 min read

David Restrepo (CentraleSupélec-Université Paris-Saclay and IHU PRISM, Gustave Roussy) and his team posted on arXiv on 17 May 2026 an evaluation of eight clinical vision-language models on 1,000 chest X-rays drawn from MIMIC-CXR. The result is uncomfortable: when the clinical text given to the model contradicts the image — a healthy patient's report attached to a pathological radiograph, or vice versa — between 31% and 66% of the initially correct decisions flip into errors. By contrast, swapping the image for that of another patient changes almost nothing. Image-only barely beats chance (0.50–0.68 accuracy), while text-only matches multimodal performance. The central conclusion is blunt: these VLMs, including frontier GPT-5 and Gemini 3 Pro and the medically adapted MedGemma variants, function essentially as report classifiers, with the image as backdrop. Important reading because it disqualifies these models for autonomous reading support, and offers a reusable stress-test methodology.

The context

Vision-language models (VLMs) are the class of generative models that combine an image input and a text input and produce a text output. In the simplest version, one shows them a radiograph and asks "does this image show a pathology?". In the clinical version, the input is enriched with elements of the patient record (reason for exam, history, prior reports), which approaches a radiologist's reading conditions. The marketing promise since 2024, backed by the announcements of GPT-4V, MedGemma, Med-PaLM-M, is that a well-trained VLM can integrate both sources and reason clinically like a human.

Several recent works (Sim et al. ACL 2025, Deng et al. CVPR 2025 "Words or vision: Do VLMs have blind faith in text?") have already suggested that generalist VLMs give excessive weight to text in multimodal reasoning. But those studies stayed on non-clinical tasks. This paper carries the critique into chest radiology, and adds two dimensions absent from standard evaluations: robustness to a prior report unrelated to the question, and stability under semantically equivalent reformulations of the prompt. These two variables are precisely what a RAG system (retrieval-augmented generation, which automatically injects relevant documents) or a clinical agent (an LLM orchestrating a cascade of tools) will vary in practice without a clinician being able to control it.

The method

The study is led by David Restrepo (MICS team, CentraleSupélec-Université Paris-Saclay, and the Cancer Data Science Unit of IHU PRISM at Gustave Roussy), with Ira Ktena (Ellison Institute of Technology, Oxford), Maria Vakalopoulou and Stergios Christodoulidis (CentraleSupélec), and Enzo Ferrante (CONICET, Buenos Aires). arXiv preprint 2605.17436 posted on 17 May 2026, DOI 10.48550/arXiv.2605.17436, under CC BY 4.0. Code and evaluation scripts on GitHub. Public funding: EU Marie Skłodowska-Curie COFUND programme (DeMythif.AI, n° 101127936) and France 2030 / ANR IA Cluster DATAIA (ANR-23-IACL-0003). Compute on Jean Zay (IDRIS-CNRS) and Ruche (Mesocentre Paris-Saclay). No commercial conflict of interest declared.

The dataset is a balanced subset of MIMIC-CXR-JPG (PhysioNet): 1,000 frontal chest radiographs, 500 normal (label No Finding) and 500 with a single pathology among five CheXpert targets (pleural effusion 30.2%, atelectasis 25.6%, cardiomegaly 21.8%, edema 18.8%, consolidation 3.6%). Cases with multiple co-occurring pathologies are excluded to avoid label ambiguity.

Eight models are tested: four generalist open-weights VLMs (Qwen2-VL-7B-Instruct, LLaVA-v1.5-7B, Janus-Pro-7B, Llama-3.2-11B-Vision-Instruct), two open medically adapted models (MedGemma-4B and MedGemma-1.5-4B), and two frontier proprietary models (GPT-5 snapshot of 7 August 2025 and Gemini 3 Pro). Deterministic inference (temperature 0) for the open models, binary "Yes/No" output forced by system prompt.

Three perturbation protocols.

First protocol: Selective Modality Shifting (SMS). One half of the inputs is kept correct and the other is replaced by the corresponding input from an opposite-class patient. Four conditions: No Shift (image + coherent text, baseline), Text Shift (normal image + pathological patient's text, or vice versa), Image Shift (coherent text, opposite-class image), and two unimodal baselines (Text-Only and Image-Only). The key metric is the Negative Flip Rate (NFR), the proportion of initially correct predictions that flip into errors after perturbation.

Second protocol: irrelevant history injection. Up to five clinically plausible but thematically unrelated prior reports (brain MRI, abdomino-pelvic CT, knee X-ray, wrist ultrasound) are inserted at the head of the prompt, with an adversarial constraint: if the current chest X-ray is pathological, the distractor reports are normal. The reports are generated by GPT-5 with synthetic dates 3 to 12 months in the past.

Third protocol: prompt sensitivity. Four semantically equivalent formulations — standard QA, role-play ("you are a clinical assistant"), formal consult request (RADIOLOGY CHECK REQUEST) and checklist — are tested in parallel, and the agreement between predictions is measured by Fleiss' κ statistic. All 95% confidence intervals are obtained by non-parametric bootstrap (100 iterations, 50% subsampling).

The results

The baseline (image + coherent text) sits between 0.66 (Janus-Pro) and 0.83 (GPT-5, Gemini 3 Pro). All models "work" on the clean benchmark.

Under Text Shift, performance collapses. GPT-5 drops from 0.83 to 0.18, Gemini 3 Pro from 0.83 to 0.17, Qwen2-VL from 0.81 to 0.20, MedGemma-1.5 from 0.79 to 0.26 — below chance (0.50). The Negative Flip Rate under Text Shift ranges from 31.3% (Janus-Pro) to 66.0% (Gemini 3 Pro): between one third and two thirds of the initially correct decisions flip into errors when an opposite-class text is inserted.

Under Image Shift, by contrast, performance barely budges. GPT-5 0.83 → 0.82; Qwen2-VL 0.81 → 0.80; MedGemma 0.76 → 0.72. NFR under Image Shift stays between 2.0% and 15.5%. The model does not see, or barely sees, the image's incongruity. This asymmetry is the paper's pivot result.

Unimodal baselines confirm it. Text-only reaches 0.78–0.83 for most models — equivalent to multimodal. Image-only tops out between 0.50 and 0.68. GPT-5 and Gemini 3 Pro achieve 0.67–0.68 image-only, slightly above chance; Qwen2-VL and LLaVA fall exactly at 0.50. The authors summarise: "VLM decisions are dominated by the text modality, even when visual evidence is available." Asking the model, via a role-play prompt, to prioritise the image produced no significant effect.

Injecting irrelevant prior reports also degrades performance. LLaVA-1.5 drops from 0.79 to 0.66 with five distractor reports, Janus-Pro from 0.70 to 0.53, MedGemma-1.5 from 0.85 to 0.71. NFR reaches 21.1% for Janus-Pro and 18.8% for MedGemma-1.5 — nearly a fifth of the correct predictions flip. Frontier GPT-5 and Gemini 3 Pro hold up better (NFR < 3%) but are not immune. The failure mode to flag here is distraction by irrelevant information, in this case within the text modality itself.

Prompt sensitivity varies considerably across models. In the modality-shifting setting, Qwen2-VL preserves excellent agreement across formulations (Fleiss' κ = 0.802), Gemini 3 Pro 0.762, GPT-5 0.753, but Janus-Pro collapses to 0.046 (essentially random), and LLaVA-1.5 stays at 0.391. A reformulation that does not change the clinical meaning can therefore reverse the prediction.

Clinical translation. If a radiology service used one of the open VLMs tested here to pre-triage 1,000 chest X-rays with a mistaken reason-for-exam — a banal situation on-call, where the order ticket may be copied from the previous exam — one would observe between 313 and 660 reclassification errors out of 1,000 decisions, depending on the model. If a RAG system defaulted to injecting the patient's last five reports (common practice for clinical agents), between 1% and 21% of correct predictions would flip into errors with no human in the loop able to identify the cause — the error comes neither from the image nor from the diagnosis of that image, but from an off-topic text added to the context.

What's good

Three specific strengths.

The stress-test protocol is reproducible and portable to other modalities. The code is on GitHub under a permissive licence and Selective Modality Shifting is fully described. Any lab can re-run the same protocol on its own data or on a new model. This is a methodological contribution at least as important as the raw numbers — the community needed a standard grid to test what clean benchmarks do not.

The model panel is broad and balanced. Four generalist open-weights VLMs, two medically adapted, two proprietary frontiers. The finding that MedGemma — specifically trained on medical image+text — suffers exactly the same failures as the non-adapted models is the point that, until yesterday, commercial teams would have contested. The authors conclude: "Domain adaptation alone is insufficient to ensure genuine visual grounding." A strong claim, now supported.

The chosen metrics are the right ones. NFR (Negative Flip Rate, Yan et al. CVPR 2021) captures exactly what worries clinically: not average performance, but the risk that a correct decision flips under perturbation. Fleiss' κ over four prompts captures decisional stability. The non-parametric bootstrap confidence intervals are methodologically sound.

What's less good

Three specific limitations.

The dataset is small and from a single centre. 1,000 radiographs drawn from MIMIC-CXR — a corpus from Beth Israel Deaconess Medical Center in Boston, already known for its biases (mostly adult population, specific scanners, local reporting conventions). The authors evaluate neither generalisation to another PACS, nor to another reporting language, nor robustness to another modality (CT, MRI). This is the classic population bias. The limitation is acknowledged explicitly in the Limitations section, but that does not erase it.

The task is binary and case selection excludes real complexity. Binary phenotype (normal vs abnormal) on cases carrying a single CheXpert pathology. Chest radiology in practice is multi-label, ambiguous and hierarchical by severity. The classic misleading metric trap looms in both directions: performance under Text Shift on harder cases might be even worse, or the protocol might underestimate situations where text would legitimately help the model disambiguate an equivocal image.

The distractor reports are synthetic, generated by GPT-5. A real prior report carries stylistic markers, author bias and chronological references that an LLM generator does not exactly reproduce. The degree to which these synthetic distractors over- or under-represent the real textual noise of a hospital record remains open. The authors acknowledge this in their limitations.

What it changes

For the medical-imaging AI research community, the paper raises the bar: a clinical VLM can no longer claim a clean AUC on a test cohort if its predictions collapse under Text Shift. Modality grounding must be demonstrated, not assumed. Three concrete consequences: future benchmarks (CheXpert, MIMIC, RSNA) should integrate an SMS protocol by default; evaluation comparators should include an honest text-only baseline (not only image-only, as is often the case, which flatters the multimodal model); peer-reviewed venues should require a prompt-sensitivity test for any clinical VLM published.

For clinicians and biomedical teams evaluating these tools for deployment, the message is operational: until this text dependency is solved, a clinical VLM can only be used as a second reader after a human who has read the image, never as an autonomous first-reader that would steer management from the image+order pair. Clinical agents that automatically stack the last report, the lab work and the prior imaging into the context window are particularly at risk: they accumulate off-topic text and switch off whatever real image reading remained.

For patients and the public, the takeaway is indirect but important. The marketing claim of clinical VLMs — "our model sees the radiograph like a doctor" — does not survive this sample. That does not mean these models are worthless; it means benchmark performance has been confused with the ability to reason from the image, and that one more generation of adversarial evaluations is needed to know where these systems are ready to intervene.