médical IA

10,000 synthetic cases against four frontier LLMs: what Auger 2026 reveals about the clinical blind spots of Gemini 3 and GPT-5 in multiple sclerosis

Published on May 23, 2026 · 9 min read

Stephen D. Auger (Imperial College London) published on medRxiv on 22 April 2026 a very large-scale evaluation of four state-of-the-art generative models — Google's Gemini 3 Pro and Flash, OpenAI's GPT-5.2 and GPT-5-mini — across up to 10,000 programmatically generated synthetic multiple sclerosis cases, with ground-truth labels validated by subspecialists. The central finding is one sentence long: diagnostic accuracy does not predict the safety of therapeutic recommendations. Even when the diagnosis is correct, the models can still recommend high-dose corticosteroids in an infected patient or intravenous thrombolysis for MS — both inappropriate, the latter outright dangerous. Important reading because it offers a scalable stress-test method and shifts the debate from MCQ benchmarks to real operational safety.

The context

Large language models (LLMs, generative language models trained on massive text corpora) now reach 90% or more on the United States Medical Licensing Examination MCQs, on MedQA, on the NEJM Image Challenge. Accompanying press releases have suggested solid clinical reasoning. But a parallel, quieter literature has been accumulating since 2024 the signs that these MCQ scores do not transfer to practice: performance that collapses when vignettes are slightly modified, hallucinated bibliographic references, extreme sensitivity to prompt wording, and — the central point of this paper — a dissociation between the ability to name the diagnosis and the ability to choose the right management.

Measuring this dissociation at scale runs into a logistical wall. Real cases are rare, ground-truth labeling is expensive, and case diversity is limited by hospital cohort inclusion biases. Multiple sclerosis (MS) offers a particularly useful ground here: it has formalized diagnostic criteria (the 2017 McDonald criteria, 2024 revision), a stereotyped clinico-anatomical mapping (lesions disseminated in space and time, spinal cord, optic, brainstem, hemispheric syndromes), and validated therapeutic strategies (high-dose corticosteroids in relapse, disease-modifying treatments, well-identified contraindications). Auger leverages this regularity to generate tens of thousands of plausible cases with verifiable labels — what no single hospital could provide, and what existing public benchmarks did not offer either.

The method

The study is run by Stephen D. Auger, clinical neurologist and researcher at the UK Dementia Research Institute Care Research and Technology Centre at Imperial College London, with clinical activity at Imperial College Healthcare NHS Trust. Preprint deposited on medRxiv on 22 April 2026, DOI 10.64898/2026.04.22.26351488.

The setup has three bricks. First brick: a procedural generator of MS clinical cases, which systematically combines symptoms (visual, sensory, motor, ataxic, sphincter disturbances), examination signs, ancillary results (brain and spinal MRI, CSF oligoclonal bands, serologies, evoked potentials), and plausible comorbidities. Each case is labeled with structured ground truth: most likely diagnosis, anatomical localization of the lesion(s), recommended investigations, expected therapeutic management. The system is parameterized to produce between 1,000 and 10,000 unique cases per run.

Second brick: four frontier generative models — Gemini 3 Pro and Gemini 3 Flash (Google), GPT-5.2 and GPT-5-mini (OpenAI) — are queried on each case with a standardized prompt. The instructions request four outputs: anatomical localization of the lesion(s), ordered differential diagnosis, recommended investigations, and therapeutic management. The LLMs are not told the case is necessarily MS — they must infer it.

Third brick: a hybrid automated evaluator compares the LLM outputs to ground truth. It combines term matching (matching of controlled medical terms, with synonym handling à la SNOMED) and semantic comparison via vector embeddings (which captures paraphrases and equivalent wordings). This evaluator was validated on an initial cohort of 70 cases by blinded MS subspecialty clinicians, who judged two things: the realism of the synthetic cases, and the agreement between the automated evaluator and their own human judgment. Only after these two validations was the system scaled to 10,000 cases.

The term ground truth refers, in AI evaluation, to the reference label against which the model's output is compared. The strength of this study is to propose a ground truth that is both clinically plausible and programmatic — hence available at scale, free of single-annotator bias.

The results

The main finding is a systematic dissociation between diagnostic accuracy and the safety of therapeutic recommendations. The four models correctly identify MS as the most likely diagnosis in the majority of cases — raw "MCQ-task" performance is respectable. But once therapeutic recommendations are examined, the picture deteriorates and reveals two opposite failure modes depending on the vendor.

Google side. Gemini 3 Flash recommends clinically appropriate corticosteroids in only 7.2% of cases (95% confidence interval: 5.6–8.8), and Gemini 3 Pro in 15.8% (13.6–18.1). For comparison, GPT-5-mini reaches 23.5% (20.8–26.1). More worryingly, Gemini models frequently recommend high-dose methylprednisolone in situations where it is contraindicated — in particular when the synthetic case explicitly mentions an active infection, or when symptoms are incidental, dated more than fourteen days old, or lacking time-of-onset information (a stabilized symptom is not a relapse and is not treated by acute-phase corticosteroids). The failure mode here is under-specificity: the model recognizes that the case is about MS, fires the default "relapse" protocol, and ignores the clinical modulators that should cancel it.

OpenAI side. The failure mode is opposite and much more alarming. GPT-5.2 recommends starting intravenous thrombolysis immediately (a treatment reserved for acute ischemic stroke, dangerous outside indication) in 9.6% of MS cases, and GPT-5-mini in 6.4%. Both Gemini models stay below 1% for this aberrant recommendation. This is not a rounding error: across 10,000 cases, GPT-5.2 proposes a useless and potentially hemorrhagic thrombolysis for about 960 patients. The failure mode here is schema collision — the model confuses the acute neurological presentation of MS with that of acute ischemic stroke and triggers the corresponding protocol.

None of these errors can be detected by an MCQ benchmark where the question would be "what is the first-line treatment of an MS relapse?". They appear only when the model is asked to reason on a full case, in free interaction — which is what real practice demands.

Clinical translation. For 1,000 consecutive patients seen by an unsupervised LLM, GPT-5.2 would propose about 96 useless intravenous thrombolyses. Off-label thrombolysis exposes, per the stroke literature, to an intracranial hemorrhage risk on the order of 2 to 6% — that is, two to six additional intracranial bleeds per cohort of 1,000, attributable to the routing error alone. Conversely, Gemini 3 Flash would deprive about 928 out of 1,000 patients of appropriate relapse corticosteroids, potentially delaying neurological recovery. None of these scenarios has occurred in practice because none of these models is currently deployed in clinical autonomy — that is precisely the paper's point: these flaws must be detected before deployment, not after.

What's good

Three specific strengths.

The evaluation scale is unprecedented for clinical LLM testing. Historical public benchmarks (MedQA, MedMCQA, NEJM Image Challenge) sit at a few thousand questions at best, often contaminated by training data. 10,000 synthetic cases with structured ground truth, generated on the fly, solve the leakage problem (the models have not seen these cases) and allow rare error rates to be measured — which is precisely what clinical safety requires. A 1% error is invisible on 100 cases and obvious on 10,000.

The automated evaluator is blind-validated against experts. The upfront validation on 70 cases by MS subspecialty clinicians avoids the classic trap of self-referential evaluation (LLM judged by another LLM, without human calibration). This methodological requirement is still far from standard in the clinical-LLM benchmarking literature, where reported "accuracy" is often that of a GPT-4 evaluator judging another GPT-4 — an obvious model-judge-and-party bias.

The paper tests truly frontier 2026 models. Gemini 3 Pro/Flash and GPT-5.2/5-mini are the current versions at the time of writing. The clinical-LLM literature suffers from rapid obsolescence: a benchmark on GPT-3.5 published in 2023 teaches little useful in 2026. This paper will be informative at least until the next generation of models ships, and it establishes a reproducible methodology for evaluating them.

What's less good

Three precise limitations to keep in mind.

The cases are synthetic, so ecological validity is limited. A case generated programmatically, even validated for realism by 70 experts, is not a patient. It lacks the ambiguities, contradictions, missing information, the noise of real history-taking, and especially the longitudinal context (personal history, current treatments, full family background). The failure mode to flag here is population bias: performance measured on synthetic cases is probably an upper bound on performance against real cases, because synthetic cases are cleaner. Auger explicitly acknowledges this and proposes the generator as a pre-screening tool ahead of prospective cohort validation — not as a substitute.

The study covers a single pathology. MS was chosen for its formalized criteria and stereotyped mapping. Nothing guarantees that the conclusions transfer to settings where the differential is more open (general internal medicine, pediatrics, geriatric polypathology). Shortcut learning in LLMs — the tendency to learn spurious correlations — could behave differently depending on the statistical regularity of the pathology. Extension to at least three or four pathologies of contrasting specificity would be needed to speak of a generalizable method.

No human comparator and no prospective evaluation. The paper compares the LLMs against each other and against ground truth, but not against the performance of a real clinician facing the same synthetic case. So it is not clear whether 23.5% of appropriate steroid recommendations (GPT-5-mini) is "catastrophically low" or "comparable to a junior on call in the first hours". This question remains open, and any commentary that cites those numbers without a comparator will quickly tip into blind enthusiasm ("GPT-5 does worse than a beginner") or its inverse ("23% is already better than a tired physician"). The classic misleading metric trap looms: a percentage without a clinical reference denominator cannot be interpreted on its own.

Additional note: this is a medRxiv preprint, not yet peer-reviewed; the final version may evolve.

What it changes

For the clinical AI research community, the methodological signal is important. Clinical-LLM evaluations have leaned heavily on MCQs, which measure recall of medical knowledge but miss the riskiest dimension — the full decision chain, from diagnosis to prescription. This paper offers an operational framework to generate cases at volume, with ground truth, and an evaluator calibrated against human experts. It is a reusable methodological brick, and other teams should be expected to apply it to other pathologies in the coming months.

For clinicians and health authorities, the message is sober: none of the four models tested is, as it stands, deployable in autonomy for prescribing. The US FDA, the European EMA and France's HAS should treat this type of large-scale stress-test as a prerequisite for any approval of a generative AI device intended for clinical use. For vendors (Google, OpenAI, Anthropic, Mistral), the paper suggests that the next generation should be trained with an explicit therapeutic-safety objective, not just diagnostic accuracy. The distinction "knowing this is MS" versus "knowing what to do with MS" is exactly the boundary to instrument.

For patients and the public, the useful takeaway is: LLMs are not ready to replace a physician for prescribing, even when they give the right name to the disease. A consumer medical chatbot can correctly diagnose your condition while simultaneously suggesting a dangerous treatment. This dissociation is counter-intuitive — the linguistic fluency creates an illusion of global competence that masks the failures of the full chain — and it explains why real clinical uses still go through a physician who keeps control, and why consulting a chatbot without a physician remains, in 2026, a bad idea.