Decryptions

Decryptions

All scientific publication decryptions on Tatakoto.

June 5, 2026 · 10 min

BreastGPT: one multimodal model for the entire breast cancer care pathway — what a 90% score on a home-made benchmark is really worth (Liu et al. 2026, arXiv)

Critical analysis of the preprint posted on 3 June 2026 to arXiv by Yang Liu and colleagues (Alibaba DAMO Academy, Zhejiang University, Hupan Lab, West China Hospital, China Medical University): BreastGPT, an 8-billion-parameter multimodal large language model claimed to cover the entire breast cancer care pathway — screening, diagnosis, treatment planning — across five imaging modalities (mammography, ultrasound, MRI, CT, pathology slides) and text. Trained on 1.86 million question-answer pairs largely built by Alibaba's own large models, it reaches 75.66% accuracy on multiple-choice questions and 89.92% on open-ended questions of its own BreastStage-Bench. A genuine engineering feat, but most of the gap comes from training on the exact test distribution: the fair comparator gains only a few points, nothing was evaluated on real patients or compared against clinicians, and the corpus is largely generated by the in-house models.

médical IA Read article →
June 4, 2026 · 11 min

MCEN: predicting complete response to breast cancer chemotherapy from a biopsy, with the Mamba architecture (Zhang et al. 2026, npj Digital Medicine)

Critical analysis of the article published on 2 June 2026 in npj Digital Medicine by Wenchuan Zhang, Shuwan Zhang, Fengling Li, Qingjie Lv, Yuhao Yi and Hong Bu (West China Hospital, Sichuan University, and colleagues): MCEN, a Mamba-based deep learning model that predicts, from a needle biopsy read as a digital slide, whether a breast cancer patient will achieve a pathological complete response after neoadjuvant chemotherapy. Trained on 1,023 patients from one Chinese hospital then tested on four independent centers (1,646 patients in total), it reaches an AUROC of 0.923 in training but falls to 0.76–0.81 on external validation, with fusion of clinicopathological data rising to 0.84. Strong for its genuine multicenter validation and Mamba's efficiency on gigapixel images, the work remains limited by a marked train–validation gap, an exclusively Chinese cohort, exclusions that drop atypical forms, and no comparison against pathologists.

médical IA Read article →
June 3, 2026 · 12 min

SKELEX: a foundation model trained on 1.3 million radiographs to read bone, from cyst to fracture (Kim et al. 2026, npj Digital Medicine)

Critical analysis of the article published on 2 June 2026 in npj Digital Medicine by Shinn Kim, Soobin Lee, Ilkyu Han, Sunghoon Kwon and colleagues at Seoul National University: SKELEX, presented as the first large-scale foundation model dedicated to musculoskeletal radiographs. A masked autoencoder with a ViT-Large backbone is self-supervised pre-trained on 1,296,540 unlabeled radiographs from a single Korean hospital (2010-2016), then adapted to 12 diagnostic tasks across 7 public datasets. It beats five baselines by 6.21% on average (relative), reaching an AUROC of 0.953 vs 0.884 for its own initialization model on bone tumor detection, is better calibrated, and matches the best models with half the labels. Convincing on label efficiency and methodological hygiene, the work is limited by single-center, single-country training data, genuine external validation restricted to the bone-tumor application alone, no comparison against radiologists, a resolution reduced to 224×224, and weights released for academic use only.

médical IA Read article →
June 2, 2026 · 12 min

PINNOCHIO: predicting the post-operative face in orthognathic surgery with a physics-informed network, as accurate as finite elements but in seconds (Lee et al. 2026, arXiv)

Critical analysis of the preprint posted on arXiv on 1 June 2026 (submitted to MICCAI 2026) by Jungwook Lee, Daeseung Kim, Kevin Gu, Zhangfeng Hu, Tianshu Kuang, Finn Hopeman, Michael A.K. Liebschner, Jaime Gateno and Pingkun Yan (Rensselaer Polytechnic Institute, Houston Methodist, Baylor College of Medicine): PINNOCHIO, a physics-informed neural network that predicts how facial soft tissue deforms after the jaws are surgically repositioned, by separating the bone–tissue interface movement from the volumetric hyperelastic deformation. On 40 real clinical cases (pre-operative CT + post-operative 3dMD surface) it matches or beats the reference finite-element simulator on surface fidelity (Chamfer distance 1.12 mm vs 1.30; 86.55% of points within 2 mm vs 80.90%) while running in 3.24 seconds instead of 3.5 hours. Convincing on speed and biomechanical plausibility, the work is limited by a 40-patient cohort, supervision that only covers the outer surface, fixed mechanical parameters shared by all patients, and no released code or weights.

médical IA Read article →
June 1, 2026 · 11 min

When an LLM must run the interview itself: an exam-inspired benchmark shows interactive diagnostic reasoning degrades performance (Zhan & Gan 2026, arXiv)

Critical analysis of the preprint posted on arXiv on 21 May 2026 by Chen Zhan, Xihe Qiu, Xiaoyu Tan, Xibing Zhuang, Gengchen Ma, Yue Zhang, Shuo Li, Peifeng Liu, Xiaoxiao Ge, Liang Liu and Lu Gan: an "OSCE-inspired" benchmark in which a standardized patient simulator forces fifteen large language models (LLMs) to run the interview themselves, turn by turn, before reaching a diagnosis. Across 468 cases, moving from information served upfront to active history-taking lowers diagnostic accuracy by 12.75% and supporting-evidence quality by 24.36%, with errors driven mainly by premature diagnostic closure and inefficient questioning. The sober, useful takeaway: rankings on static medical multiple-choice exams likely overstate what these models can do in a real consultation. Caveats: the patient simulator is itself algorithmic, the provenance of the cases is not detailed in the accessible abstract (contamination risk), and the figures are reported as relative values without an explicit human comparator.

médical IA Read article →
May 31, 2026 · 12 min

GTBIS: a deep learning model that reads the morphology of combined pulmonary neuroendocrine carcinomas to predict prognosis (Yang & Zhou 2026, npj Digital Medicine)

Critical analysis of the npj Digital Medicine paper of 30 May 2026 by Lin Yang, Ruyu Sheng, Zijian Yang, Shilong Liu and Meng Zhou (National Cancer Center / Cancer Hospital of the Chinese Academy of Medical Sciences in Beijing, Wenzhou Medical University and Harbin Medical University Cancer Hospital): GTBIS, an interpretable deep learning model that reads pathology-slide morphology to distinguish small cell lung carcinoma (SCLC) from large cell neuroendocrine carcinoma (LCNEC), then applies that reading to combined cSCLC-LCNEC tumors to stratify their prognosis. Across multicenter cohorts totaling 670 patients, the model splits chemoradiotherapy-treated combined tumors into a favorable-prognosis SCLC-like subgroup (five-year overall survival 100% vs 39.5%, disease-free survival 87.5% vs 36.0%) and a poor-prognosis LCNEC-like subgroup, the classification remaining an independent prognostic factor in multivariable analysis. But the sample is modest, all centers are Chinese, validation is retrospective without an explicit human comparator, and the CC BY-NC-ND license closes adaptation.

médical IA Read article →
May 30, 2026 · 12 min

Pathog-PDx: a machine learning system to identify 22 pediatric respiratory pathogens from the electronic health record (Su 2026, npj Digital Medicine)

Critical analysis of the npj Digital Medicine paper of 29 May 2026 by Dubin Su, Qun Chen, Ruizhi Xu and colleagues (First Affiliated Hospital of Xiamen University, Zhengzhou University, Nanjing University, Shenzhen Second People's Hospital and UIUC): Pathog-PDx, a diagnostic system that combines 42 clinical and laboratory features from the electronic health record to distinguish 22 pathogen subtypes responsible for respiratory infections in hospitalized children. Development cohort of 134,500 children across three centers and two databases, prospective independent validation on 1,338 children, mean AUC 0.88 across the 22 pathogens and 0.95 for influenza virus, public deployment of a web-based decision support tool. But all development centers are Chinese, the human clinical comparator is absent, the CC BY-NC-ND license blocks academic adaptation, and the very nature of the gold standard for 22 classes deserves a separate discussion.

médical IA Read article →
May 29, 2026 · 12 min

EpiVLM: a vision-language model for video seizure detection and classification, from hospital to home (He 2026, npj Digital Medicine)

Critical analysis of the npj Digital Medicine paper of 26 May 2026 by Mengqiao He, Leihao Sha, Pengfei Wei, Lei Chen and colleagues (West China Hospital, Sichuan University and Shenzhen Institutes of Advanced Technology, CAS): EpiVLM, a vision-language model (VLM) that combines clinically structured prompts with video reasoning to recognize five seizure semiologies on 232 video recordings from 127 patients (11,666 annotated segments) drawn from two tertiary centers, unconstrained home recordings and an independent public dataset. Accuracy 0.795–0.947, sensitivity 0.842–0.957, video-level false detections 0.47–2.45%, mean onset-to-detection delay under 6 seconds, with prompts and thresholds fixed a priori without site-specific recalibration. But all tertiary centers are Chinese, the home cohort is barely described in the abstract, there is no head-to-head comparison with human annotators, and one co-author is affiliated with a private LLC (Brain Everest) without a competing-interest declaration.

médical IA Read article →
May 28, 2026 · 12 min

An automated neuroimaging pipeline for personalized post-stroke cognitive prognosis (Brzus 2026, npj Digital Medicine)

Critical analysis of the npj Digital Medicine paper of 27 May 2026 by Michal Brzus, Joseph Griffis, Aaron D. Boes and colleagues (University of Iowa): a fully automated DICOM-to-PDF pipeline that segments ischemic lesions with a 3D Residual U-Net, predicts 28 neuropsychological outcomes via lesion network mapping, and drafts a personalized report via air-gapped LLaMA 3.3 70B in under three minutes. Training on 604 patients from the Iowa Lesion Registry, independent testing on 153 ischemic stroke patients imaged on 17 scanner models. AUCs of 0.74 to 0.90 on five detailed cognitive domains, 96% concordance between predictions from automatic versus manual segmentations. But training and testing from the same center, no clinical comparator (NIHSS, mRS, demographics alone), clinical review of reports by the senior author himself, and four of the seven authors hold the associated patent and co-founded NeuroPred Inc.

médical IA Read article →
May 27, 2026 · 12 min

SHAP and SVM to predict deep venous thrombosis after endometrial cancer surgery (Zhou 2026, npj Digital Medicine)

Critical analysis of the npj Digital Medicine article of 27 May 2026 by Qing Zhou and colleagues: a four-variable SVM model (postoperative D-dimer, age, fibrinogen, FIGO stage) predicts deep venous thrombosis after endometrial cancer surgery, with AUC 0.828 in internal validation and 0.819 in an external cohort across 841 + 95 Chinese patients. SHAP makes contributions interpretable. But symptom-triggered imaging (detection bias), 100% Chinese cohort, no head-to-head comparison with Caprini/Wells scores, and D-dimer measured after surgery — this is more an early-detection aid than a strict prediction.

médical IA Read article →
May 26, 2026 · 11 min

UNet-MoE-Cli: a mixture-of-experts to personalize neoadjuvant therapy in rectal cancer (Liu 2026, npj Digital Medicine)

Critical analysis of the npj Digital Medicine article of 26 May 2026 by Xiangyu Liu and colleagues: UNet-MoE-Cli, a mixture-of-experts deep learning model on multiparametric MRI and clinical variables, estimates regimen-specific pathological complete response probabilities for neoadjuvant therapy in locally advanced rectal cancer. AUC 0.827 in internal validation, 0.790 in prospective cohort (ChiCTR2400085797), but sensitivity only 0.45–0.53, single-centre nCT expert, 100% Chinese cohort, and the escalation benefit is computed by the model itself.

médical IA Read article →
May 25, 2026 · 9 min

When text eats the image: what the Restrepo 2026 study reveals about the contextual fragility of clinical VLMs on MIMIC-CXR

Critical analysis of the arXiv preprint 2605.17436 of 17 May 2026 by David Restrepo (CentraleSupélec-Université Paris-Saclay) and colleagues: eight vision-language models evaluated on 1,000 MIMIC-CXR chest X-rays lose up to 66% of their correct decisions when the clinical text is swapped for that of an opposite-class patient. Image-only tops out at 0.50–0.68, text-only matches multimodal. Even MedGemma, adapted to medical data, collapses. These VLMs are report classifiers disguised as image readers.

médical IA Read article →
May 24, 2026 · 8 min

PromptRad: labelling liver CT reports with only 32 annotated examples, and matching GPT-4

Critical analysis of the May 2026 arXiv preprint 2605.20052 (BioNLP 2026 @ ACL) by Ying-Jia Lin and colleagues (Chang Gung University, Taiwan): a 110-million-parameter PubMedBERT, fine-tuned via prompt-tuning with a UMLS-enriched verbalizer, achieves 89.2% macro F1 on seven categories of liver lesions in CT — from only 32 annotated reports, and with better negation handling than GPT-4.

médical IA Read article →
May 23, 2026 · 9 min

10,000 synthetic cases against four frontier LLMs: what Auger 2026 reveals about the clinical blind spots of Gemini 3 and GPT-5 in multiple sclerosis

Critical analysis of Stephen D. Auger's April 2026 medRxiv preprint (Imperial College London): up to 10,000 synthetic multiple sclerosis cases with ground truth, four frontier models (Gemini 3 Pro/Flash, GPT-5.2/5-mini) evaluated on diagnosis, localization, investigations and management. Diagnostic accuracy does not predict therapeutic safety: Gemini under-uses appropriate corticosteroids, GPT-5 recommends intravenous thrombolysis in nearly one out of ten cases.

médical IA Read article →
May 22, 2026 · 8 min

GPT-4 in radiology: why the format of an LLM's explanation changes physicians' diagnostic accuracy

Decryption of Spitzer et al.'s 2026 npj Digital Medicine paper: a randomized trial with 101 radiologists comparing three formats of GPT-4 explanation. Chain-of-thought adds 12.2 percentage points of accuracy, while differential diagnosis induces automation bias. Implications for the clinical deployment of LLMs.

médical IA Read article →
May 21, 2026 · 8 min

GigaPath in digital pathology: what changes when a foundation model is trained on 1.3 billion tiles

Critical analysis of the Nature 2024 paper on Prov-GigaPath, a transformer foundation model for digital pathology. Architecture, data, performance on 26 cancer benchmarks, and what it really changes for diagnosis.

médical IA Read article →