médical IA

BreastGPT: one multimodal model for the entire breast cancer care pathway — what a 90% score on a home-made benchmark is really worth (Liu et al. 2026, arXiv)

Published on June 5, 2026 · 10 min read

Yang Liu, Jiajin Zhang, Danyang Tu, Yingda Xia and colleagues (Alibaba DAMO Academy, Zhejiang University, Hupan Lab, and West China and China Medical University hospitals) post to arXiv on 3 June 2026 BreastGPT, an 8-billion-parameter multimodal large language model presented as covering the entire breast cancer care pathway — screening, diagnosis, treatment planning — across five imaging modalities (mammography, ultrasound, MRI, CT, pathology slides) plus clinical text. Trained on 1.86 million question-answer pairs largely manufactured by Alibaba's own large models, it reaches 75.66% accuracy on multiple-choice questions and 89.92% on open-ended questions of its own BreastStage-Bench, far ahead of general-purpose models queried cold. It is a serious engineering demonstration, but most of the gap comes from training on the exact test distribution: the only fair comparator gains just a few points, nothing was evaluated on real patients or compared against clinicians, and the "ground truth" is largely generated by the in-house models.

The context

Managing a breast cancer is a staged pathway. At screening, one mostly reads mammograms (and, increasingly, picks up breast lesions opportunistically on chest CTs done for other reasons). At diagnosis, ultrasound, MRI and, for confirmation, the microscopic examination of a biopsy — pathology — follow one another. At treatment planning, subtype, extent and expected response are integrated. Each step calls on a different imaging modality and a different kind of reasoning.

AI has so far tackled these steps one at a time: one model for mammography, another for ultrasound, another for the slide. The authors start from an observation: there is neither a dataset nor a single model that crosses the whole pathway. Their proposal is a multimodal large language model (an MLLM: a language model able to "see" images on top of reading and writing text) queried as VQA (visual question answering: it is shown an image and asked a question, multiple-choice or open-ended). A single system, meant to answer across five modalities and three stages. The ambition is clear; what it is worth, once measured, is the question.

The method

The article is an arXiv preprint (2606.04911, posted 3 June 2026, under a CC BY-NC-SA license, not yet peer-reviewed). BreastGPT is built on Qwen3-VL, an Alibaba vision-language model, in its 8-billion-parameter version. The central trick is a dual-branch visual encoder with modality-based routing: a "standard" branch (Qwen3-VL's native image encoder) handles CT, MRI, ultrasound and mammography; a "gigapixel" branch handles pathology slides, which are images of several billion pixels. This second branch cuts the slide into tiles at high magnification, encodes each tile with CONCH (an encoder pre-trained on pathology images), then aggregates everything with LongNet, a "dilated" attention architecture designed for very long sequences.

To avoid drowning the language model under tens of thousands of tiles, the authors adapt a "concept-preserving" token compression technique: instead of passing everything, it selects 128 visual tokens that maximize the coverage of useful information. The method needs no extra training. Routing between tasks does not go through specialized heads but through system prompts that tell the model the stage and the task. Training used 32 H100 graphics cards over a little more than three days.

On data, the "BreastStage" corpus gathers about 662,000 images, 136 task templates and 1.86 million instruction pairs, from 17 sub-datasets covering five modalities (split: 57.9% screening, 36.7% diagnosis, 5.4% treatment). The image sources are mostly public — CT-RATE for CT (20,546 female volumes), BUS-CoT for ultrasound (11,439 images), a subset of EMBED for mammography, and BCNB, TCGA-BRCA and TCGA-HISTAI for pathology (2,510 slides) — with a single private MRI cohort, from two hospitals, annotated by ten breast specialists, whose Chinese reports were machine-translated. Crucially, a large share of the text (open-ended questions, captions, simulated reports) is not written by humans but generated by Alibaba's own models (Qwen2.5-VL-72B for image-related decisions, Qwen3-Max for text transforms).

The results

On its own benchmark BreastStage-Bench (12,182 test cases, split at the patient level), BreastGPT reaches 75.66% accuracy on multiple-choice questions and 89.92% on open-ended ones. General-purpose models queried without specific training are far behind: GPT-5.4 at 54.0 / 53.6, dedicated medical models like Lingshu at 50.4. It is this contrast the abstract highlights.

But the figure that really matters lies elsewhere, and the authors are honest enough to provide it: an 8-billion-parameter Qwen3-VL, identical but simply fine-tuned on the same data, already reaches 68.21% / 88.24%. The genuine contribution of the dual-branch architecture and token compression thus shrinks to about 7 points on multiple-choice questions and less than 2 points on open-ended ones. The bulk of the gain is not the architecture: it is having trained the model on the exact test distribution. The architectural benefit is real mainly in pathology, where the gigapixel branch lifts accuracy from 60.4 to 71.4%.

Clinical translation. Here one must be blunt: these percentages translate into nothing clinical. A 75% accuracy on a multiple-choice quiz is neither a screening sensitivity nor a specificity; it does not say how many cancers would be missed or how many false alarms raised on real patients. No performance was measured on a clinical endpoint, no comparison to a radiologist or pathologist was made, and the evaluation runs entirely on data of the same origin as the training. In other words, BreastGPT answers well questions built like those it has seen — encouraging for a prototype, but it tells us almost nothing about what it would do on a real case.

What works well

Real engineering for the gigapixel, and a measurable gain where it counts. Making "normal" radiology images and multi-billion-pixel pathology slides coexist in a single model is a hard technical problem. The CONCH + LongNet + 128-token compression combination is a careful answer, and it is on pathology that the architectural contribution is clearest (60.4 → 71.4% on multiple-choice questions). The idea of a single assistant that follows the pathway rather than siloed tools is, in principle, the right direction.

Scale and, above all, an honest ablation comparator. The corpus is massive and documented (662,000 images, 17 sub-datasets, five modalities). And the authors do not stop at beating general-purpose models: they report the performance of a Qwen3-VL simply fine-tuned on their data. It is precisely this comparator that lets the reader see that the true architectural gain is modest — providing this figure is to their credit, as many teams would omit it.

Transparency on status and limits. The paper states explicitly that BreastGPT is a "research prototype", not clinically validated, not reviewed by a regulatory authority, and that it should not be taken for an autonomous diagnostic system. It acknowledges that the data are not longitudinal (rarely the same patient followed end to end), recommends site-specific validation, and announces the release of the code and benchmark under a non-commercial license.

What works less well

A biased comparator and a misleading metric in the framing. Comparing a model trained on the test distribution to general-purpose models queried cold (GPT-5.4 "at only 49.3%") is an unbalanced comparison: a match between a candidate who has seen the past papers and candidates discovering the exam. The biased comparator inflates the announced gap ("over 25 / 35 / 40%" by stage), whereas the only fair comparator, their own fine-tuned model, brings the real gain back to a few points. Communicating on the large gap rather than the small one is a misleading metric.

A circular evaluation, with a risk of data leakage and shortcut learning. The benchmark is built from the same 17 datasets as the training, by the same team, with the same generation pipeline. The split is at the patient level, but a given image can reappear in different tasks, and the report templates recur: fertile ground for data leakage and shortcut learning, where the model learns the style of the questions rather than the medicine. Worse, the "ground truth" of open-ended questions is generated by Alibaba's own models, then scored by an evaluator also based on a language model: the grader and the candidate share the same family, which mechanically rewards "Qwen"-style answers. The most sensitive subsets are tiny (113 open-ended questions and 70 captions in pathology), making the claims on the "treatment" stage fragile.

No patient, no clinician, and blind spots in population and governance. The evaluation is entirely on benchmark: no result on real patients, no clinical endpoint, no direct comparison to a radiologist or pathologist — the three breast specialists involved only audited data quality, never faced the model. The cohorts are mostly Chinese and from specific centers (the MRI, private, comes from two hospitals), leaving intact the risk of population bias and collapse on other scanners or other countries. Finally, neither funding nor conflicts of interest are declared, even though the base model, the data-generating models and the grader are all products of the same industrial group — a dependency that would deserve to be made explicit.

What this changes

For the research community, BreastGPT cuts both ways. On one hand, it is a reusable blueprint for multimodal assistants that follow a clinical pathway, and the gigapixel brick (CONCH + LongNet + token compression) is transferable to other cancers. On the other, it is a textbook case of the limits of the "home-made benchmark": when the team that trains the model also builds the test set, the ground truth and the grader, spectacular scores lose their value as evidence. The need for external, independent, human-annotated benchmarks has never been clearer.

For clinicians, the tool has no immediate reach: an unvalidated prototype, with no comparison to expert reading, no prospective test, no regulatory status. The idea of a single assistant accompanying screening, diagnosis and decision is appealing in the medium term, but it requires crossing everything missing here — evaluation on real patients, against real doctors, in real hospitals.

For patients and the public, the message fits in one sentence: a 90% score on a benchmark is not a safe medical tool. Multimodal language models can produce fluent, plausible answers while being wrong or hallucinating details — a risk all the more serious because the "ground truth" of this work was itself written by models. A well-phrased answer is not a correct answer, and an impressive prototype is not a validated device.