médical IA

SKELEX: a foundation model trained on 1.3 million radiographs to read bone, from cyst to fracture (Kim et al. 2026, npj Digital Medicine)

Published on June 3, 2026 · 12 min read

Shinn Kim, Soobin Lee, Kyoungseob Shin, Han-Soo Kim, Yongsung Kim, Minsu Kim, Juhong Nam, Somang Ko, Daeheon Kwon, Wook Huh, Ilkyu Han and Sunghoon Kwon (Seoul National University) publish in npj Digital Medicine on 2 June 2026 SKELEX, presented as the first large-scale foundation model dedicated to musculoskeletal radiographs. A masked autoencoder with a ViT-Large backbone is pre-trained, with no labels, on 1,296,540 radiographs from a single Korean hospital between 2010 and 2016, then adapted to 12 diagnostic tasks evaluated across 7 public datasets. The model beats five baselines by 6.21% on average (relative) — for instance an AUROC of 0.953 vs 0.884 for its own initialization model on bone tumor detection — is better calibrated than its competitors, and matches the best models with half the labels. It is a solid demonstration of the value of domain-specific self-supervised pre-training; it must nonetheless be read against single-center, single-country training data, genuine external validation limited to the bone-tumor application alone, no head-to-head comparison against radiologists, a resolution reduced to 224×224, and weights released for academic use only.

The context

Radiography is the most common imaging exam in the world, and the musculoskeletal system — bones, joints — accounts for a huge share of it: fractures, osteoarthritis, bone tumors, deformities. Yet interpretation depends on radiologists whose numbers do not keep pace with exam volume. Deep learning has promised help for years, but most often in a narrow form: a model trained in a supervised way (from images labeled one by one by an expert) for a single task on a single dataset. Each new question — detect a wrist fracture, grade knee osteoarthritis, spot a tumor — requires starting from scratch and re-annotating thousands of images, which is slow and costly.

The foundation model idea reverses this logic. A large network is first pre-trained in a self-supervised way — with no labels, by having it learn the structure of the images themselves — on a mass of data, then adapted to many downstream tasks with few labeled examples. This recipe has already transformed digital pathology (with GigaPath) and chest radiography. The musculoskeletal system, however, lacked its large generalist model. SKELEX (for musculoSKELEtal X-ray) presents itself as the first to fill that gap.

The method

The article (npj Digital Medicine, 10.1038/s41746-026-02826-9, received 16 January, accepted 21 May, published 2 June 2026, open access under a CC BY-NC-ND license) rests on a masked autoencoder (MAE: a large fraction of the image is randomly hidden and the network is trained to reconstruct the missing areas — it thus learns to represent anatomy without ever being told what it is looking at). The backbone is a ViT-Large (vision transformer: the image is cut into small 16×16-pixel tiles treated like the words of a sentence; here a 24-block encoder and an 8-block decoder). The masking ratio is 75%, and the reconstruction loss is computed only on the hidden tiles.

Pre-training happens in two stages: starting from an MAE already trained on ImageNet (everyday photos), then domain-adapting it on the radiograph set. That set, named SNUH-1M, comprises 1,296,540 unlabeled radiographs, pulled from the PACS (the image-archiving system) of Seoul National University Hospital between 2010 and 2016, covering 15 anatomical regions and more than 89 conditions. The entire pre-training required only a single RTX A6000 graphics card and about 1,630 hours of compute — a modest budget for a model of this size.

To measure what the model learned, the authors then adapt it to 12 diagnostic tasks across 7 public datasets: pediatric wrist fracture and its fine classification (GRAZPEDWRI-DX), fracture and orthopedic-implant detection (FracAtlas), abnormality detection (MURA, 40,005 studies), presence then benign/malignant characterization and 9-class subtyping of bone tumors (BTXRD, from three Chinese hospitals), knee osteoarthritis grading on the Kellgren-Lawrence scale (OAI), flatfoot (PesPlanus) and bone-age estimation (RSNA Bone Age). An important hygiene point: these public sets were excluded from pre-training to avoid any leakage, and most evaluations are run on a held-out test sample (10%) within each set. SKELEX is compared to five models: ResNet-101, two ViTs pre-trained on ImageNet (including its own initialization model, ViT-MAE/I1K) and two medical self-supervised models, BiomedCLIP and Radio-DINO. The authors add a region-guided multi-head classifier: a YOLO11x detector localizes 29 anatomical regions, then a region-specific head takes over.

The results

Across all tasks, SKELEX delivers an average relative improvement of 6.21% over its own initialization model, under an identical protocol. The clearest result concerns bone tumor detection, with an AUROC of 0.953 (AUROC, the area under the ROC curve, measures the ability to tell a positive case from a negative one: 1.0 is perfect, 0.5 is chance) vs 0.884 for ViT-MAE/I1K, 0.902 for the ViT pre-trained on ImageNet-21K, 0.903 for ResNet-101, 0.914 for BiomedCLIP and 0.867 for Radio-DINO. Relative gains range from 5.39 to 12.30% on tumor subtyping, 2.78 to 13.47% on flatfoot and 2.20 to 7.66% on wrist-fracture classification.

Two results stand out. First, calibration: the expected calibration error (ECE — the gap between the confidence the model announces and its actual accuracy; the lower it is, the more the displayed probability can be trusted) falls to 0.096 on bone tumors vs 0.133 for the best competitor, roughly a 27.8% relative reduction. Second, label efficiency: with only 50% of the labeled data, SKELEX reaches an AUROC of 0.941 on tumor detection — higher than the best baseline trained on 100% of the labels (0.914); the same holds on MURA (0.855 with half the labels, vs 0.846 for the best full-data baseline). The region-guided classifier identifies the anatomical region with a mean AUROC of 0.999 and keeps an AUROC above 0.9 on all abnormality classifications. The gaps are supported by a resampling statistical test (paired bootstrap, 5,000 draws), with p-values often below 0.001.

Clinical translation. An AUROC of 0.953 is obtained on a balanced test set (1,867 tumors to 1,879 non-tumor cases). Yet in a real population, a bone tumor is rare: at low prevalence, the same AUROC translates into an absolute number of false positives far higher than it appears — that many unnecessary follow-up exams and unwarranted anxieties. The most practically useful result is therefore not the raw detection figure, but label efficiency: a department with few annotated cases — typically for a rare condition — could adapt the model at lower cost. Still, these are retrospective evaluations on held-out samples, not a test under real clinical conditions.

What works well

The scale of pre-training and a measured label-efficiency gain. Pre-training without labels on 1.3 million radiographs, then showing the model reaches with 50% of the annotations what competitors do with 100%, attacks the real bottleneck of musculoskeletal AI: the cost of expert annotation. The gain is quantified (0.941 vs 0.914 on tumors with half the labels), not merely asserted.

Unusual methodological hygiene. The public evaluation sets were deliberately excluded from pre-training to avoid data leakage (when test images end up in training and artificially inflate scores). Where splitting could not be done by patient, the authors hunted duplicates by image similarity (SSIM) and MD5 fingerprint, and they publish the positive/negative counts "for transparency." This level of precaution is rare.

Calibration is reported, not just AUROC. Measuring ECE and obtaining the best calibration (–27.8% on BTXRD) matters clinically: a well-calibrated model says "I'm 80% sure" when it is actually right 80% of the time, which is essential for a clinician to know how far to trust it. The code and weights are moreover deposited on GitHub and a web prototype is accessible.

What works less well

A single hospital, a single country: population bias is not ruled out. The 1.3 million images all come from the same Korean institution, over 2010-2016 — same machines, same protocols, same population. Nothing guarantees generalization to other equipment, other countries, other morphologies, and the authors acknowledge it. Above all, the genuine external validation (on data of independent origin) covers only one of the twelve applications, bone tumor; the other eleven are evaluated on held-out samples inside the public sets. And the two external sources used for tumors (Radiopaedia, MedPix) are curated teaching-image banks — not consecutive clinical cohorts, which introduces a selection bias.

No radiologist on the other side, and metrics that flatter. Despite a narrative built on the radiologist shortage, no quantified head-to-head human-machine comparison is reported: the comparator remains algorithmic. Some measures also call for caution — an AUROC of 0.999 for anatomical-region identification, or perfectly balanced test sets (1,867 vs 1,879) that do not reflect real prevalence, are classic cases of a misleading metric: excellent on the bench, they say nothing about performance at the real operating threshold. The "6.21%" average gain, finally, is measured only against the initialization model.

Possible residual leakage, reduced resolution, bounded reproducibility. For FracAtlas, BTXRD and PesPlanus, the train/test split was done at the image level, not the patient level: despite the SSIM and MD5 controls, two views of the same patient may end up on either side, an open door to data leakage. The mandatory downsizing to 224×224 pixels can erase fine signs — a non-displaced fracture, the faint medullary lucency of a tumor — which the authors admit. Finally, the pre-training data are not released, the weights are released "for academic research use only," and the article is under a CC BY-NC-ND license (no commercial reuse, no derivatives): independent reproducibility and any real deployment remain bounded. No CE marking or FDA clearance is mentioned. Funding (Korean public bodies: KHIDI/Ministry of Health, KUCRF, MOTIE, the BK21 program) and the absence of conflicts of interest are properly declared.

What this changes

For the research community, SKELEX confirms that the foundation-model recipe — massive self-supervised pre-training then label-efficient adaptation — also works on musculoskeletal radiography, a field that until now lacked a large generalist model. Releasing the weights for research lets other teams build on it. The expected next steps are clear: multi-center and multi-country pre-training, patient-level external validation extended to all twelve tasks, higher resolution, and finally a comparison against radiologists.

For clinicians, the tool is not deployable today: it is a research prototype (a web demonstration exists), with no prospective validation, no comparison to human reading and no regulatory clearance. Its potential medium-term value is twofold: lowering the annotation cost for rare conditions, and one day serving as a triage aid or second look — never an autonomous diagnostic act.

For patients and the public, the promise is that of broader, cheaper musculoskeletal AI, useful in particular for rare situations such as bone tumors. Caution remains warranted: a model that performs well on retrospective Korean radiographs is not, as it stands, validated to interpret your own X-ray. A prediction is not a diagnosis, and the decision remains the responsibility of the care team.