médical IA

MCEN: predicting complete response to breast cancer chemotherapy from a biopsy, with the Mamba architecture (Zhang et al. 2026, npj Digital Medicine)

Published on June 4, 2026 · 11 min read

Wenchuan Zhang, Shuwan Zhang, Fengling Li, Yuanyuan Zhao, Jing Fu, Xiuli Xiao, Ting Yin, Qingjie Lv, Yuhao Yi and Hong Bu (West China Hospital, Sichuan University, and four other Chinese hospitals) publish in npj Digital Medicine on 2 June 2026 MCEN, a Mamba-based deep learning model that predicts, from a needle biopsy read as a digital slide, whether a breast cancer patient will achieve a pathological complete response after neoadjuvant chemotherapy. Trained on 1,023 patients from a single hospital then tested on four other independent centers — 1,646 patients in total — it reaches an AUROC of 0.923 in training, falls to 0.76–0.81 on external validation, and climbs to 0.84 when routine clinicopathological data are added. It is a noteworthy demonstration of Mamba's value for digital pathology and of genuine multicenter validation; it must nonetheless be read against a marked train–validation gap, an exclusively Chinese cohort, exclusion criteria that drop atypical forms, and the absence of any head-to-head comparison against pathologists.

The context

For many breast cancers, chemotherapy is given before surgery: this is neoadjuvant chemotherapy. The aim is to shrink the tumor, make breast-conserving surgery possible, and test the tumor's sensitivity to treatment in real time. The best possible outcome has a name: pathological complete response (pCR), defined as the absence of any residual invasive cancer in the breast and the axillary nodes when the surgical specimen is examined. Patients who reach pCR generally have a much better prognosis; conversely, predicting in advance who will not respond would spare months of toxic, futile chemotherapy or prompt an alternative strategy from the outset.

The trouble is that this prediction is hard. The classic tools — molecular subtype, the Ki-67 proliferation index, tumor-infiltrating lymphocyte (TIL) levels on slides, gene signatures, MRI radiomics — each capture one facet, but their manual assessment suffers from strong inter-observer variability and does not capture the spatial complexity of the tumor microenvironment. Digital pathology (the computational analysis of histology slides scanned at high resolution, called whole-slide images or WSIs) has opened another route: convolutional neural networks (CNNs) have already learned to predict pCR from the initial biopsy. But a WSI is a gigapixel image — billions of pixels — and transformer architectures, which excel at modeling long-range dependencies through the attention mechanism, have a compute cost that grows with the square of the sequence length: impractical at this scale. It is this bottleneck the team proposes to break with Mamba.

The method

The article (npj Digital Medicine, 10.1038/s41746-026-02849-2, received 28 January, accepted 26 May, published 2 June 2026, open access under a CC BY-NC-ND license) presents MCEN — for Mamba-based model for Chemotherapy Efficacy using Needle biopsy. Mamba is a selective state space model: instead of comparing every element to all others like attention, it scans the sequence while maintaining a compressed internal state that it updates at each step, giving it linear complexity while keeping a global receptive field. On a gigapixel slide cut into tens of thousands of small tiles, this property changes everything.

The pipeline has three stages. First the biopsy WSI is cut into tiles. Then each tile is encoded by CONCH, a vision-language encoder pre-trained specifically on pathology images (the authors compared it to three other extractors — CTransPath, Phikon, ViT-S/16 — and CONCH gets the best AUROC, 0.780, vs 0.677 for ViT-S/16). Finally an online re-embedding module (a transformer block that re-tunes the representations to the slide's context) feeds a bidirectional Mamba aggregation following the multiple instance learning principle (MIL: there is only one label for the whole slide, and the model learns to weight the relevant tiles without pixel-by-pixel annotation). MCEN is compared to reference MIL methods — ABMIL, CLAM, TransMIL, plus simple mean and max pooling — and beats them, while cutting inference time by 23.1% relative to TransMIL.

On data, 1,646 patients come from five hospitals: West China (WC, n=1,023), Shengjing (SJ, n=306), Shanxi Cancer (SXC, n=187), Sichuan Provincial People's (SCPP, n=80) and Southwest Medical University (ASWMU, n=50). The WC cohort is randomly split into training (n=819) and internal validation (n=204), with a 27.5% pCR rate in both; the four other centers serve as independent external tests. The authors apply strict exclusion criteria (no bilateral, multifocal, or rare subtypes such as lobular, mucinous or tubular carcinomas), stain normalization, early stopping and dropout against overfitting, and random-forest imputation for missing data. A complementary arm fuses the MCEN score with clinicopathological variables via an XGBoost model interpreted by SHAP.

The results

MCEN reaches an AUROC of 0.923 on the training cohort (AUROC, the area under the ROC curve, measures the ability to tell a responder from a non-responder: 1.0 is perfect, 0.5 is chance), but 0.78 on internal validation and a range of 0.761 to 0.809 across the four external centers. Adding routine clinicopathological data lifts these figures: 0.937 in training, 0.811 in validation, and up to 0.84 externally. The model's score clearly separates the groups — a mean of 0.771 in responders vs 0.212 in non-responders in the training cohort, a significant gap (p < 0.05) maintained across all external centers. On multivariate analysis, molecular subtype and the MCEN score both emerge as independent predictors, and the attention maps show the model focuses mostly on fibrosis and stroma regions. Performance is weaker in the HR–/HER2+ and HR–/HER2– subgroups of some centers, likely for lack of numbers.

Clinical translation. In this population, about one patient in four achieves pCR. An AUROC of 0.76–0.81 under external conditions corresponds to moderate discrimination: the model ranks clearly better than chance, but stays far from certainty — there will be responders flagged as at-risk and non-responders wrongly reassured. Concretely, such a score cannot on its own decide to de-escalate or intensify a chemotherapy; it is meant to add to subtype, stage and Ki-67 to refine a probability, not to replace them. And it must be recalled that pCR is a surrogate endpoint: it correlates with a better prognosis, but it is not survival itself.

What works well

A genuine multicenter external validation. This is the strong point. The model is trained on a single hospital then evaluated without tuning on four independent cohorts of differing sizes and practices (306, 187, 80 and 50 patients), with performance that holds (0.761–0.809). Most AI pCR studies settle for internal validation on a single small cohort; here the cross-center trial by fire is actually passed, which is the most frequent obstacle to deployment.

An efficient architecture, fit to the problem, on an already available specimen. Mamba brings linear complexity where transformer attention chokes on gigapixel slides, with inference 23.1% faster than TransMIL at comparable performance. Above all, the input is the pre-treatment needle biopsy: the information is available at the very moment the strategy is decided, with no extra exam.

Methodological honesty and public code. The authors frankly report the drop between training and validation, show in multivariate analysis that the MCEN score stays predictive independently of subtype, justify the choice of the CONCH encoder with a quantified comparison, and release their code on GitHub for academic use. Fusion with clinical variables is presented as complementary, not as a replacement.

What works less well

A train–validation gap that calls for caution on the headline figure. Going from 0.923 in training to 0.78 in internal validation, then 0.76–0.81 externally, is a classic sign of optimism: the 0.923 mostly reflects the fit to seen data, not the performance expected elsewhere. Communicating on the upper bound would be a misleading metric; the honest value, the one that matters to a patient, is the external range, and it reflects only moderate discrimination on an imbalanced task (27.5% pCR).

Population bias, exclusions, and possible confounding by subtype. The five centers are Chinese and tertiary: nothing guarantees generalization to other populations, scanners or staining protocols, and the authors acknowledge it. The strict exclusion criteria (lobular, mucinous, tubular carcinomas, bilateral or multifocal forms dropped) restrict the model to invasive carcinoma of no special type alone — a selection bias that makes it inapplicable as-is to atypical forms. Finally, pCR depends very strongly on molecular subtype (high in triple-negative and HER2+ tumors, low in HR+/HER2–): since the model leans on fibrosis and stroma, which covary with subtype, one must ask how far it learns the biology of response rather than a shortcut correlated with subtype (shortcut learning). The multivariate analysis argues for real added value, but the question warrants subtype-specific analyses, which the authors themselves call for.

No pathologist on the other side, retrospective, and a surrogate endpoint. No head-to-head human-machine comparison is reported: the comparator remains algorithmic (other MIL methods, clinical models). The study is entirely retrospective, with no prospective validation or pragmatic clinical trial, and concerns pCR — a surrogate correlated with survival, not survival itself. The input is limited to the needle biopsy (transfer to surgical specimens or other tumors remains to be established), the license is CC BY-NC-ND (no commercial use, no derivatives) and no CE marking or regulatory clearance is mentioned. To the work's credit: funding is public (Chinese provincial funds and the NSFC) and the authors declare no conflicts of interest.

What this changes

For the research community, MCEN adds a solid stone to a movement already under way: Mamba and state space models are credible alternatives to transformers for MIL aggregation on gigapixel slides, with a measured efficiency gain. Releasing the code and showing that validation across four centers is attainable give a base others can build on — ideally with multinational cohorts and subtype-specific analyses.

For clinicians, the tool is not deployable today: retrospective, with no comparison to expert reading, no prospective validation and no regulatory status. Its potential medium-term value is clear — providing, from the initial biopsy, a response probability that adds to subtype and Ki-67 to discuss de-escalation in likely responders or an alternative strategy in likely non-responders — but this requires crossing the prospective step and confirming benefit on hard endpoints.

For patients and the public, the promise is that of more personalized chemotherapy, read from a specimen already taken, with no extra procedure. Caution remains warranted: a model that performs well on retrospective Chinese slides is not, as it stands, validated to guide a treatment, and moderate discrimination means errors in both directions. A prediction is not a decision, and the therapeutic choice remains the responsibility of the care team.