médical IA

An automated neuroimaging pipeline for personalized post-stroke cognitive prognosis (Brzus 2026, npj Digital Medicine)

Published on May 28, 2026 · 12 min read

Michal Brzus, Joseph Griffis, Aaron D. Boes and colleagues (University of Iowa) publish in npj Digital Medicine on 27 May 2026 a fully automated pipeline that ingests raw DICOM brain MRIs, automatically segments ischemic lesions, predicts 28 individual neuropsychological outcomes via lesion network mapping, and returns a personalized report drafted by an open-weights LLM — the whole pipeline running in under three minutes per patient. The models are trained on 604 patients from the Iowa Lesion Registry and evaluated on an independent cohort of 153 ischemic stroke patients imaged on 17 different scanner models (Siemens, Philips, GE, Olea Medical) between 2002 and 2023. AUCs between 0.74 and 0.90 on five detailed cognitive domains, 96% concordance between predictions made from automatic versus manual segmentations, and LLM reports generated air-gapped by LLaMA 3.3 70B with explicit guardrails — but to be read with four major caveats: training and test data come from the same institution (Iowa), no standard clinical comparator (NIHSS, mRS, demographics alone) is reported, the final clinical validation of the reports is performed by the senior author himself, and four of the seven authors hold the associated patent and have co-founded the commercial startup NeuroPred Inc. that will exploit the technology.

The context

Stroke is the world's second leading cause of mortality and the leading cause of acquired disability in adults. Recovery trajectories are highly heterogeneous — two patients with comparable lesion volumes can end up with radically different cognitive sequelae depending on the precise location of the destroyed tissue and on the functional networks it was embedded in. The tools used in routine clinical care — the NIHSS score for admission severity, the modified Rankin Scale (mRS) for overall functional disability, and a handful of cognitive screening scales such as the MoCA — remain coarse, almost never account for individual lesion mapping, and offer weak prognostic value for the detail of cognitive functions.

The field of lesion network mapping, developed largely by Aaron Boes's group at Iowa and Michael Fox's group at Harvard since 2015, offers an alternative: project each individual lesion onto normative structural and functional connectomes to identify not only the damaged tissue but also the network it was interrupting. Several publications from the same group (Bowren et al., Brain 2022; J. Neurosci. 2020) have shown that these maps predict chronic cognitive outcomes better than simple lesion size or coarse location. What remained was to transform this research method, until now manual and demanding, into a deployable clinical tool. That is exactly what the Brzus 2026 paper attempts.

The method

The study is led by Aaron D. Boes, a neurologist at the Carver College of Medicine, University of Iowa, with co-first authors in electrical engineering (Michal Brzus) and neurology (Joseph Griffis, formerly at Omniscient Neurotechnology 2021–2023). Published 27 May 2026 in npj Digital Medicine, DOI 10.1038/s41746-026-02803-2, under CC BY 4.0. Public funding (NIH R01 NS114405, Roy J. Carver Trust, MRI instrument 1S10OD025025-01). Released as an unedited "Article in Press" version, so still subject to revision.

The pipeline chains four components. First, a DICOM preprocessing module: an in-house classifier (dcm_classifier, published on PyPI) identifies modality and acquisition plane with reported accuracy above 99%. A 3D Residual U-Net handles brain masking (skull stripping) with a mean Dice score of 0.98. The SynthSR tool (Iglesias et al., Science Advances 2023) synthesizes a high-resolution T1 from the available sequences to stabilize registration to the MNI-152 atlas (successful on 99.7% of 2,987 validation images). Second, an ischemic lesion segmentation by a 3D Residual U-Net (trained on roughly 450 Iowa subjects + 250 subjects from the ISLES 2022 challenge), using only diffusion sequences (DWI + ADC) — the authors verified that adding T1, T2 or FLAIR brings no statistically significant improvement. Third, cognitive prediction via the Iowa Brain-Behavior Modeling Toolkit (Griffis et al., Human Brain Mapping 2024): 28 binary Partial Least Squares classification models (impaired / unimpaired), each combining three representations — voxel-wise lesion mask, structural lesion network map (sLNM, computed on the HCP MGH 32-fold connectome via Lead-DBS), and functional lesion network map (fLNM, computed on the GSP-1000 normative sample) — aggregated by a ridge logistic regression that also includes age and education. Fourth, a report module that feeds the predictions and anatomical mapping to LLaMA 3.3 70B, hosted locally via Ollama in an isolated Docker container with no internet access, which formats a readable PDF (SMOG reading grade 6.6, i.e. US 6th–7th grade level), DICOM-encapsulated and returned to the PACS.

Training of the cognitive models uses 604 patients from the Iowa Lesion Registry (mixed etiology: stroke, but also tumours and trauma, a limitation the authors acknowledge) with neuropsychological evaluations at least three months after the lesion for 98.7% of them. The end-to-end evaluation runs on 153 ischemic stroke patients from the Benton Neuropsychology Clinic (still University of Iowa), imaged within one week of the stroke between 2002 and 2023 on 17 different scanner models from four manufacturers at 1.5 T and 3 T.

The results

Segmentation detects 93% of lesions larger than 1 cm³ and 98% of those larger than 2.5 cm³, with a mean Dice score of 0.69 (0.74 on post-2015 scanners), comparable to the top systems from the ISLES 2022 challenge. The headline 96% concordance refers to the predicted cognitive classifications derived from automatic versus expert-traced segmentations (681 individual predictions across 57 patients) — not to raw segmentation agreement, a distinction that is easily lost in casual reading.

Cognitive performance is reported on 28 neuropsychological outcomes. Five detailed examples, chosen to span distinct domains, yield AUCs of 0.74 to 0.90: expressive language (verbal fluency, AUC ≈ 0.90), receptive language (Token Test), visuospatial (Judgment of Line Orientation, sensitivity 91% / specificity 71%), auditory working memory (Digit Span), executive functions (Trails B). Comparison of modeling strategies shows a significant gain from adding the network maps on top of the lesion alone (Wilcoxon signed rank N=28, p=0.007) and from adding demographic covariates (p=0.002). The authors however explicitly acknowledge that AUCs vary substantially across the 28 outcomes: some models exceed 0.8, many lie between 0.6 and 0.8, and a few fall below 0.5 — in other words worse than chance. Specificity also collapses between training cross-validation (0.84 for the Token Test) and the independent test set (0.55), pointing to a threshold-calibration problem. On the timing side, the full pipeline runs on average in 121 seconds on a Xeon + RTX 6000 Ada 48 GB workstation, i.e. under three minutes for 95% of cases.

Clinical translation. To anchor the numbers on 1,000 ischemic stroke patients imaged routinely with this pipeline: 70 patients carrying small lesions (<1 cm³) would be missed at the segmentation stage — precisely those where cognitive risk is hardest to gauge clinically. Out of the remaining 930, the LLM report would propose individual probabilities for 28 cognitive functions; in practice, roughly two thirds of those probabilities would be useful (AUC ≥ 0.7) and a third would be either uncertain or misleading. With an observed specificity around 55%, nearly one in two patients flagged "at risk" on a given cognitive domain would in fact be a false positive. This is a serious decision aid, provided that clinicians and patients understand what the numbers actually say.

What is good

The end-to-end integration is technically mature and the output format is designed for the clinic. The pipeline ingests raw DICOM, handles 17 scanner models and three major manufacturers, runs in under three minutes on a single workstation, and returns a DICOM-encapsulated PDF directly to the hospital PACS. Very few post-stroke prediction papers go this far in deployment engineering; most stop at a model evaluated on a clean dataset.

The use of the LLM is unusually cautious and concretely useful. The model (LLaMA 3.3 70B) runs locally with no internet access, never receives an image or a clinical note, its role is explicitly restricted to natural-language formatting of fixed templates, and a Markdown parser checks template adherence post hoc. This architecture cuts off the classic failure modes of generative AI in healthcare (number hallucination, PHI leakage, unsolicited therapeutic recommendation). The SMOG 6.6 reading grade also indicates reports accessible to the patients themselves, which is a coherent editorial choice.

The methodology rests on a decade of converging work and the prediction meta-architecture is rigorous. The lesion location + sLNM + fLNM approach aggregated via ridge regression was not invented for the occasion: it extends ten years of work from the group (Boes Brain 2015, Bowren Brain 2022, Griffis HBM 2024) with stratified 5×5 cross-validation, 1,000-iteration permutation tests, and formal statistical comparisons of strategies. The IBB toolbox and the dcm_classifier code are public on Zenodo and PyPI.

What is less good

The claimed heterogeneity is not real external validation. The article highlights 17 scanner models and two decades of data, but training (Iowa Lesion Registry) and testing (Benton Neuropsychology Clinic) both come from the University of Iowa. Regional population, local neuropsychology protocols, impairment classification norms calibrated on the same cohorts: the model has never been confronted with a patient from another hospital system, another region, another majority ethnicity. This is the population bias failure mode, compounded with a particularly insidious variant of shortcut learning — the models may have learned to recognize cohort signatures rather than lesion-cognition relationships. Generalization to other centers remains to be demonstrated.

No standard clinical comparator is reported. The authors concede that it is "difficult to compare directly with other published models", but that does not explain the absence of the simplest baseline: would a model using only age, education and NIHSS severity have done as well? Without that reference point, and without a head-to-head comparison with competing imaging models (Liu Stroke 2023, Matsulevits bioRxiv 2025), it is impossible to quantify the real gain brought by network mapping over a plain logistic regression on three variables. This is the biased comparator failure mode by omission.

Clinical validation of the LLM reports is performed by the senior author himself. Across the 153 generated reports, the authors state that "no hallucination or structural drift was identified" on technical review, then that a board-certified stroke neurologist reviewed thirty reports (≈ 20%) without detecting any error affecting patient care. That neurologist is A.D.B., i.e. Aaron D. Boes — corresponding author, co-inventor of the patent, and co-founder of NeuroPred Inc., the startup that will commercialize the technology. A blinded review by an external clinician would have considerably strengthened the credibility of this finding. To this is added the existence of AUCs falling below 0.5 for some outcomes (information absent from the abstract), and the "96% concordance" metric which measures agreement between two segmentation modes and not agreement with clinical ground truth — two nuances easily lost in rapid communication.

What it changes

For the computational neurology research community, the paper marks the industrial maturation of lesion network mapping. The method until now relied on manual research pipelines, demanding in both time and expertise (segmentation by a neuroradiologist, MNI normalization, connectivity computation). Full automation reshuffles the field — future publications will have to position themselves against a fast, reproducible pipeline, and laboratories without the resources to build their own infrastructure will be able to rely on the published open-source components. Upcoming evaluations should however systematically demand a true multi-center validation and head-to-head comparison with NIHSS and mRS.

For vascular neurologists and rehabilitation teams, the message is one of informed caution. The authors themselves specify that they "are not advocating for the clinical use of the outcome-prediction component in its current form" — a sober statement that deserves to be remembered. The tool is not ready to change an individual treatment decision, but it already has a place as a support for structured communication with the patient and family (a report readable at middle-school level, in under three minutes), as a basis for early rehabilitation planning, and as a foundation for prospective studies where the pipeline would be validated alongside a reference neuropsychological assessment. The pending patent and the creation of NeuroPred Inc. announce a commercial trajectory that will need monitoring, in particular the transparency of calibration on future cohorts.

For patients and the public, the useful lesson is nuanced. The promise of a personalized cognitive prognosis in under three minutes from a standard MRI is real and coming; it will probably reshape the post-stroke conversation in the next five to ten years. But a numerical probability is not a destiny. When a report will state, for instance, "high risk of working-memory deficit", the clinician will need to be able to translate that this estimate rests on an Iowa cohort, gets it wrong almost half the time in the false-positive direction, and entirely ignores the non-cerebral determinants of recovery (motivation, social support, access to rehabilitation, comorbidities). Mapping a lesion does not exhaust the prognosis of a life.