UNet-MoE-Cli: a mixture-of-experts to personalize neoadjuvant therapy in rectal cancer (Liu 2026, npj Digital Medicine)

Xiangyu Liu, Yuanling Tang, Song Zhang and colleagues (Xidian University, West China Hospital of Sichuan University, Institute of Automation of the Chinese Academy of Sciences) published on 26 May 2026 in npj Digital Medicine UNet-MoE-Cli, a hard-gated mixture-of-experts deep learning model that combines pre-treatment multiparametric MRI and clinical variables to estimate, for each patient with locally advanced rectal cancer, the probability of pathological complete response (pCR) under three neoadjuvant regimens: standard chemoradiotherapy (nCRT), total neoadjuvant therapy (TNT) and chemotherapy alone (nCT). Across 855 patients (760 retrospective at three Chinese centres and 95 prospective under ChiCTR2400085797), AUC reaches 0.827 in internal validation and 0.790 in the prospective cohort. The model recommends treatment escalation for 53% of patients and de-intensification for 6%. An important read, because it pushes further the promise of data-driven oncology, but one to handle carefully: sensitivity tops out at 0.45–0.53, the estimated benefit of escalation is computed by the model itself, the nCT expert is trained on a single centre, and the cohort is entirely Chinese.

The context

Locally advanced rectal cancer (LARC, cT3-4 or cN+ without metastasis) accounts for about 40% of new rectal cancer diagnoses. Over the past twenty years, the standard of care has shifted from long-course preoperative chemoradiotherapy (nCRT, about 50 Gy in 25–28 fractions with capecitabine) to total neoadjuvant therapy (TNT, adding 4 to 6 cycles of CAPOX before or after irradiation), then to purely chemotherapy strategies without radiation (nCT) for selected subgroups. The PRODIGE-23 (France, 2020), RAPIDO (Netherlands, 2020) and OPRA (United States, 2022) trials established TNT as the reference regimen for high-risk tumours, increasing pathological complete response (pCR) rates — complete tumour disappearance on the surgical specimen — from 14% under nCRT to 28% under TNT.

The problem is that this intensification is applied at the population level, not the individual level. A patient who would have responded to short nCRT is given six extra months of chemotherapy and all its toxicity; a patient destined not to respond gets the same long protocol with no benefit. Current clinical scores (NCCN, MERCURY-2) stratify risk but do not predict response to each specific regimen. This is precisely the gap the paper targets: not a risk classifier, but a counterfactual response model, able to estimate "this patient would have X% chance of pCR under TNT, Y% under nCRT, Z% under nCT".

The method

The study is led by Xin Wang (Cancer Center, West China Hospital, Sichuan University), Zhenyu Liu and Jie Tian (Institute of Automation, Chinese Academy of Sciences). Published on 26 May 2026 in npj Digital Medicine, DOI 10.1038/s41746-026-02798-w, under CC BY-NC-ND 4.0. Public Chinese funding (National Key R&D Program 2024YFF1207400, NSFC 62333022 and others). The authors declare no conflict of interest. ChatGPT was used for language editing.

The retrospective dataset comprises 760 patients treated between June 2015 and May 2022 across three Chinese centres: West China Hospital, Sun Yat-sen University Cancer Center and the Sixth Affiliated Hospital of Sun Yat-sen University. A prospective cohort of 95 patients was recruited between July 2024 and January 2025 across two of these centres plus a new site, Yunnan Cancer Hospital (registration ChiCTR2400085797 of 18 June 2024). The retrospective regimen distribution is imbalanced: 414 patients on nCRT, 258 on TNT, only 88 on nCT — this last arm coming from a single centre. Base pCR rates per regimen are 19% (nCRT), 30% (TNT), 20% (nCT). Inclusion: histologically confirmed adenocarcinoma, pre-treatment T2 MRI and ADC map, TME surgery with complete pathological evaluation.

The architecture, named UNet-MoE-Cli, combines three building blocks. A mixture of experts is a composite model in which several sub-networks ("experts") each learn to model a sub-problem, and a "gate" mechanism selects the relevant expert for a given input. Here, each expert is dedicated to one regimen (TNT, nCRT, nCT) and the gate is hard-gated, i.e. deterministic: the regimen choice selects the corresponding expert via argmax. The imaging backbone is nnUNet (Isensee et al., 2021), a self-configuring segmentation network trained multi-task to both delineate the tumour and extract features. MRI modalities (T2W + ADC) are projected into 64-D embeddings via modality-specific MLPs, concatenated with clinical variables one-hot encoded (cT, cN, EMVI, CRM, lateral lymph node involvement, location), then fed to the MoE. The objective combines a cross-entropy loss on pCR and a Dice loss on segmentation. Parameter count is not reported.

Evaluation uses a random 80/20 split of the retrospective set (618 train / 142 validation), then the 95-patient prospective cohort as test. Reported metrics: AUC, accuracy, sensitivity, specificity, PPV, NPV, decision curve analysis (DCA), inverse probability of treatment weighting (IPTW) by centre and stage, DeLong test for AUC comparisons, McNemar for paired comparisons. No multiple-testing correction, no formal calibration curve (Brier, Hosmer-Lemeshow), no explicitly mentioned bootstrap intervals.

The results

UNet-MoE-Cli's AUC is 0.827 (95% CI 0.742–0.904) in internal validation and 0.790 (0.667–0.900) in the prospective cohort. Internal comparators — a LightGBM on clinical variables (AUC 0.58–0.64), a ResNet-2D (0.64), a ResNet-3D (0.67–0.60), a UNet alone (0.73–0.65), a soft PoE variant (0.59) — are all beaten, sometimes clearly. By regimen, AUC is 0.80 under TNT, 0.82 under nCRT, 0.75 under nCT.

But the clinically important observation lies in sensitivity: 0.455 in validation, 0.526 in prospective. The model misses half of true responders. The high specificity (0.90–0.96) and decent PPV (0.58–0.77) tell the other side: when the model says "complete response", it is often right; but when it says "no", it is wrong one in two times.

On the recommendation side, across the combined validation + test cohort (n=237): 53.2% of patients are steered to escalation, 40.9% to maintaining their regimen, and 5.9% (n=14) to de-intensification. This is where critical reading is required. The paper reports that under recommended escalation, observed pCR under the actual regimen is only 11.1%, against a model-estimated pCR of 31.0% under the escalated regimen. The jump looks huge — except the "estimated pCR" is the model's own output applied to its own advice. The comparison is circular: with no randomised arm and no prospective follow-up of a subgroup that actually received the suggested regimen, we cannot tell whether the benefit is real or hallucinated.

The de-intensification subgroup (n=14, observed pCR 92.9%) is more clinically interesting but too small to conclude: the confidence interval ranges from 66% to 99% and these patients were already highly selected (low T stage, no EMVI). Kaplan-Meier disease-free survival curves are significant in training (p=0.02) and validation (p=0.03), but not in prospective test.

Clinical translation. Across 1,000 LARC patients to whom this model would be applied pre-treatment, around 530 would be offered escalation to TNT or an intensified regimen, and 60 would be offered de-intensification. Among the 200 true responders (average pCR rate 20%), the model would correctly identify between 90 and 105 — meaning it would miss 95 to 110 patients who would have responded and for whom de-intensification would have been legitimate. Conversely, among the 800 non-responders, it would correctly classify 720 to 770, rightly recommending escalation. The risk-benefit ratio therefore depends on the clinical value placed on avoiding over-treatment (toxicity, infertility, functional impairment) versus missing an opportunity for de-intensification.

What's good

Three specific strengths.

The per-regimen mixture-of-experts architecture is elegant and suited to the problem. Rather than asking a single network to learn response to all protocols, the model isolates one sub-network per regimen, which reduces the risk of averaging therapeutic effects and allows estimation of counterfactual probabilities specific to each option. Hard gating makes inference interpretable: we know which expert spoke for which patient. Conceptually well posed for a personalised treatment-selection problem.

The pre-registered prospective cohort is a real methodological step. ChiCTR2400085797 was registered on 18 June 2024, before prospective data collection. The 95 patients from July 2024 to January 2025 were evaluated blind to the model. Sturdier than mere cross-validation, even if the cohort stays small and limited to three centres in the same cultural region.

The panel of internal comparators is thorough. The authors test a clinical LightGBM, a ResNet-2D, a ResNet-3D, a UNet alone, a soft MoE variant, a PoE variant — all beaten. The ablation shows that MoE + clinical variables + multimodal MRI is necessary for performance. This ablation discipline is too often absent from competing papers.

What's less good

Three specific limitations.

The 0.45–0.53 sensitivity undercuts the clinical value for de-intensification support. This is the classic misleading metric failure mode: an AUC of 0.80 sounds good, but when the positive class (pCR) represents only 20% of cases, the model can reach that AUC by being excellent at saying "non-responder" (specificity 0.96) and mediocre at identifying true responders. For a tool whose central argument is to de-intensify in responders, that is exactly the wrong asymmetry. A PPV of 0.58 in validation means one in two patients labelled "likely responder" will not respond — non-negligible risk of inappropriate de-intensification.

The proof of escalation efficacy is circular. The paper's pivot table compares observed pCR under actual regimen to the model-estimated pCR under recommended regimen. The jump from 11% to 31% is not an experimental measurement, it is the prediction of a model assessing its own prescription. Without a pragmatic randomised trial assigning patients to "MDT decision" vs "MDT decision + model", it is impossible to know whether escalation truly improves response, or whether the model simply errs the same way in both directions.

The nCT expert is trained on a single centre and the cohort is 100% Chinese. This is population bias on two axes: geographical and ethnic. The neoadjuvant protocols used (CAPOX standard, concurrent capecitabine) differ from those validated in Europe (FOLFIRINOX in PRODIGE-23) or the United States (FOLFOX in OPRA). Molecular marker distributions (MSI-H, KRAS, BRAF) vary across populations. No Western validation cohort is shown. Until that generalisation is demonstrated, the model only applies to an Asian patient receiving the standard Chinese regimen panel.

What it changes

For the AI-oncology research community, the paper formalises a useful approach: modelling regimen-specific response rather than an agnostic risk score. The MoE-per-treatment architecture is transposable to other pathologies where several competing protocols coexist (breast neoadjuvant, Hodgkin lymphoma, leukaemias). Three expected consequences: future submissions to npj Digital Medicine and Radiology AI should include explicit counterfactual comparisons; the community needs a standard for evaluating these models other than with their own output; regulators (FDA SaMD, EMA) will have to clarify the status of a "regimen recommender" versus a simple "risk predictor".

For oncologists and multidisciplinary teams in rectal cancer, the operational message is patience. The tool is not ready for clinical use: no Western validation, no pragmatic randomised trial, GitHub code announced but not public at publication, data shared "upon reasonable request" (the usual reproducibility red flag). At minimum, a prospective phase II SMART-type trial (Selection of Multimodal Adjuvant Regimen by Tool) comparing "MDT + model" vs "MDT alone" on hard endpoints (DFS, OS, quality of life) will be needed before any deployment. pCR remains an imperfect intermediate endpoint for disease-free and overall survival.

For patients and the public, the takeaway is that part of next decade's precision oncology is being built today on this type of algorithm. The promise — less over-treatment, less under-treatment — is credible and worth pursuing. But moving from a published AUC to a shared clinical decision will require years of comparative trials. Any patient who, in the near future, is offered an algorithmic regimen recommendation should ask: on what cohort was the model validated? what is its sensitivity in true responders? has it been tested on patients like me?

Further reading

The full article is open access on npj Digital Medicine: nature.com/articles/s41746-026-02798-w. The prospective trial registration is at the Chinese Clinical Trial Registry, ChiCTR2400085797. Code is announced at github.com/LiM2D/RCRS upon acceptance (to verify). For context on TNT trials in rectal cancer, see OPRA (NEJM 2022), RAPIDO (Lancet Oncol 2020) and PRODIGE-23 (NEJM 2020). For our coverage of clinical model failure modes, see our decryption of the Restrepo 2026 study on clinical VLMs.