médical IA

SHAP and SVM to predict deep venous thrombosis after endometrial cancer surgery (Zhou 2026, npj Digital Medicine)

Published on May 27, 2026 · 12 min read

Qing Zhou, Fudan Liu, Donghong Wang and colleagues (Zunyi Medical University, Guizhou, and Naval Medical University, Shanghai) publish in npj Digital Medicine on 27 May 2026 an explainable machine learning model that predicts lower-extremity deep venous thrombosis (LEDVT) after surgery for endometrial cancer, on 841 patients in the derivation cohort and 95 in external validation. The final model is a Support Vector Machine (SVM) with four features — postoperative D-dimer, age, fibrinogen, FIGO clinical stage — reaching an AUC of 0.828 in internal validation and 0.819 externally, supplemented with SHAP explanations that decompose every individual prediction. An important read because it shows how interpretable tooling matures in perioperative oncology, but one to handle carefully: imaging was symptom-triggered (a detection bias acknowledged by the authors), the cohort is entirely Chinese, postoperative D-dimer is measured 24 to 48 hours after surgery (sometimes after silent thrombus onset), and no head-to-head comparison with the Caprini or Wells scores is reported.

The context

Endometrial cancer is the most common pelvic gynaecological cancer in high-income countries; the standard of care remains staging surgery (total hysterectomy with bilateral salpingo-oophorectomy, possibly with lymphadenectomy). Lower-extremity deep venous thrombosis (LEDVT) is a classic postoperative complication that can progress to fatal pulmonary embolism if missed. Prevention today relies on static clinical scores — Caprini, Wells, Khorana — combining a handful of factors (age, history, BMI, anaesthesia, surgical type) and triggering pharmacological or mechanical prophylaxis.

The well-documented problem is that these scores were developed on mixed cohorts (general surgery, orthopaedics, internal medicine) and perform poorly in gynaecologic oncology. They also do not integrate dynamic postoperative biomarkers (D-dimer especially) nor tumour-specific characteristics (FIGO stage, lymphovascular space invasion). Hence the surge — since 2020 — in machine learning models exploiting the full perioperative EHR. The niche of this paper is sharp: a model dedicated to endometrial cancer surgery, individualized prediction, and crucially SHAP-based interpretability to break past the "black-box" wall that still slows clinical adoption.

The method

The study is led by Lin Xu (Key Laboratory of Cancer Prevention and Treatment of Guizhou Province), Yonghu Chang (School of Medical Information Engineering, Zunyi Medical University) and Donghong Wang (Department of Obstetrics and Gynecology, Affiliated Hospital of Zunyi). Published 27 May 2026 in npj Digital Medicine, DOI 10.1038/s41746-026-02782-4, under CC BY 4.0. Public Chinese funding (Qiankehe programmes, Guizhou Provincial Health Commission). The authors declare no financial or non-financial conflicts of interest. The code is announced as open on github.com/cyh407; the data remain available "on reasonable request" with a data-use agreement.

The retrospective dataset comprises 841 patients operated for endometrial cancer between October 2011 and March 2026 across five hospitals in Guizhou Province (Affiliated Hospital of Zunyi, Guizhou Provincial People's Hospital in Guiyang, Yanhe Tujia Autonomous County People's Hospital, Third Affiliated Hospital of Zunyi, Liupanshui Maternal and Child Health Hospital). The composite endpoint "postoperative LEDVT" gathers any deep venous thrombosis occurring within 30 days, confirmed by colour Doppler ultrasound or CT venography. Among 841 patients, 72 (8.6%) developed LEDVT. The derivation cohort is split 80/20 (training n=673, internal validation n=168); an independent external cohort of 95 patients recruited between April 2025 and March 2026 is used as the test set.

Twenty-seven perioperative variables are retained after multicollinearity filtering (Cramér's V for discrete, Pearson for continuous). Twenty-six classification algorithms are benchmarked (NearestCentroid, BernoulliNB, RandomForest, AdaBoost, SVM, Logistic Regression, XGBoost, LightGBM, etc.) under five rebalancing strategies (none, random oversampling, SMOTE, SMOTE-Tomek, ADASYN). Random oversampling — which simply duplicates minority-class samples — is selected as the optimal strategy based on mean AUC. Stratified 5-fold cross-validation tunes hyperparameters, with rebalancing applied strictly within training folds to avoid leakage to validation folds.

Recursive Feature Elimination (RFE) is then applied to the six most stable models. The SVM achieves the best performance/parsimony trade-off with only four features: postoperative D-dimer (measured 24 to 48 hours after surgery), age, fibrinogen, and FIGO clinical stage. The Support Vector Machine (SVM) is a classifier that fits an optimal separating hyperplane in a transformed feature space; its decisions are commonly treated as a "black box". To address that opacity, the authors apply SHAP (SHapley Additive exPlanations), a game-theoretic method that assigns each feature a quantified contribution to an individual prediction and aggregates them into a global importance ranking. SHAP dependence plots visualise the non-linear association of each feature with predicted risk.

The results

Reported ML performance is as follows: AUC = 0.823 on the training set, 0.828 (95% CI 0.706–0.905) on internal validation, 0.819 on the external cohort. Calibration is qualified as good on both held-out sets (calibration plots shown, without Hosmer-Lemeshow test or Brier score). Decision Curve Analysis (DCA) shows positive net benefit over a risk-threshold range of 5% to 52%. No sensitivity, specificity, positive nor negative predictive value at the operating threshold is reported in the main text — a notable gap for a tool meant to trigger prophylaxis.

SHAP analysis surfaces qualitatively useful relationships. Postoperative D-dimer shows a monotonic positive association with risk (mean |SHAP| = 0.06, the highest). Age shows a U-shape: extreme values — young patients with aggressive tumour biology, or elderly patients with endothelial dysfunction — both raise risk, mid-range is neutral. Fibrinogen is protective at low values then becomes a risk factor beyond a standardised threshold of roughly 2. FIGO stage increases risk monotonically. The model is then wrapped in a web prototype that returns an individual probability and a SHAP force plot from the four input values.

Clinical translation. Per 1,000 patients operated for endometrial cancer, about 86 would develop symptomatic LEDVT in the 30 postoperative days at the observed base rate. At an 8% DCA threshold, the model would likely flag 200 to 300 as "high risk" (the exact count is not provided), of which roughly half would be true positives. In practice, deployed as is, the tool would direct enhanced prophylaxis (extended low-molecular-weight heparin, intermittent pneumatic compression, protocolised early mobilisation) to slightly more than one patient in four, and spare the remaining three-quarters from systematic prophylaxis. But the exact translation depends on the threshold chosen and the relative cost of false positives vs. false negatives, which the paper leaves to clinicians.

What is good

Three specific strengths.

The internal-comparator methodology is rigorous. The authors benchmark 26 algorithms under 5 rebalancing strategies, with stratified 5-fold cross-validation and rebalancing isolated inside training folds. This anti-leakage discipline is explicit in the Methods — too many competing papers omit it. The multi-model RFE strengthens the final four-variable choice: no single model decides; a consensus does.

The interpretability effort is serious and operational. SHAP is not a post-hoc decoration here: the authors extract a clinical reading of non-linear associations (U-shape for age, threshold for fibrinogen) and ship a web prototype with individual force plots. This addresses a real clinical demand, since clinicians routinely reject non-explainable models even at high AUC. The agreement between SHAP-extracted associations and known pathophysiology (D-dimer = fibrinolytic activation, fibrinogen = inflammation/hypercoagulability) supports plausibility.

External validation exists and the cohort is multicentric. Five hospitals contribute to derivation, and a more recent subset (April 2025 – March 2026, n=95) serves as the external test. The nearly identical AUC between internal (0.828) and external (0.819) is a strong signal that the model is not grossly overfitted to the lead hospital. The Python code is announced open on GitHub, which would at least allow computational reproduction.

What is less good

Three specific limitations.

Imaging is symptom-triggered — a major detection bias that changes the nature of the target. The authors acknowledge this in the discussion: Doppler ultrasound and CT venography were not systematic but performed on clinical signs or laboratory abnormalities. The "LEDVT" label in the dataset is therefore not "every LEDVT occurring" but "symptomatic LEDVT detected by routine practice". Asymptomatic thromboses — which may be the majority in surgical series — are missing. This is a textbook shortcut learning failure mode: the model learns to predict the combination "patient whom clinicians decided to image" more than the pathology itself. Any generalisation to a systematic-screening setting would require prospective re-validation with protocolised imaging.

The absence of head-to-head comparison with existing scores is hard to justify. Caprini, Wells and Khorana are named in the introduction as the benchmark to beat, but no table reports their AUC on this cohort nor the statistical difference with the four-variable SVM. Even more problematic: logistic regression is among the six stable models and uses the same four variables, but its final figures are not directly compared to the SVM. Given that the four predictors retained (D-dimer, age, fibrinogen, stage) are continuous or ordinal variables for which logistic regression is usually competitive, the claim that SVM adds value is not demonstrated. This concerns the biased comparator failure mode.

The cohort is entirely Chinese and the "predictive" measurement is postoperative. All 841 + 95 patients come from Guizhou or Shanghai. No Western validation is presented. The FIGO-stage distribution, the median age at surgery (53 years), and the prophylaxis protocols differ from those observed in Europe or North America. That is the classic population bias. To this is added a design weakness: D-dimer is measured 24 to 48 hours after surgery. By then, a silent thrombus may already have started forming, and D-dimer functions as much as an early marker of the event being detected as a predictor of an event still to come. The tool is therefore less a "preoperative predictor" than an "early postoperative detection aid" — useful, but in a different use case than the introduction implies.

What changes

For the perioperative AI research community, the paper confirms a deep trend: since 2024, postoperative-complication prediction models almost systematically include an interpretability layer (SHAP, LIME, attention maps). What distinguishes this work is its parsimony effort — down to four variables — and the delivery of a web prototype. Three expected consequences: future competing papers will need to include a head-to-head comparison with validated clinical scores; the SHAP community will need to clarify interpretability limits when features are strongly correlated (D-dimer and fibrinogen are); regulators will need to rule on the SaMD status of a web interface that returns individual probabilities.

For gynaecologic oncologists and perioperative surgical teams, the message is positive caution. In its current form, the model is not ready for broad clinical deployment: no Western validation, acknowledged detection bias, no Caprini/Wells comparison, data only "on request". A prospective study with protocolised imaging in every patient would be needed to estimate the true performance of the model, followed by head-to-head comparison with standard scores on hard endpoints (confirmed LEDVT, pulmonary embolism, prophylaxis-related bleeding). In the meantime, the paper's main value is pedagogical: it documents a good practice of interpretable ML pipeline that other teams can replicate.

For patients and the public, the useful lesson is that precision perioperative medicine is arriving — a model deciding patient by patient whether anti-thrombotic prophylaxis should be intensified or relaxed. But introducing such a tool in a preoperative consultation will require honest disclosure: it is a probabilistic support based on a specific cohort, not an individual certainty. A patient told her "risk is calculated at 12%" has the right to know which population the model was validated on, whether she resembles that population, and what standard prophylaxis without the model would have been. SHAP transparency on the clinician side has value only if it translates into transparency on the patient side.