PromptRad: labelling liver CT reports with only 32 annotated examples, and matching GPT-4

Ying-Jia Lin (Chang Gung University, Taiwan) and her team posted on arXiv on 19 May 2026 PromptRad, a paper accepted at the BioNLP 2026 workshop at ACL that proposes a method for automatically labelling liver CT reports under an extremely small annotation budget. The result is one sentence long: with only 32 annotated reports and a 110-million-parameter model — less than 1% the size of GPT-4 — their prompt-tuning approach enriched with UMLS synonyms reaches 89.2% macro F1 across seven categories of liver lesions, matches GPT-4 zero-shot, and beats it on negation handling. Important reading because it pushes back on the idea that clinical-NLP performance necessarily requires model scale, and because it offers a setup that hospitals can deploy locally, without sending patient data to a cloud vendor.

The context

Radiology reports are among the richest seams of clinical data that exist. For every imaging study performed, a radiologist writes free text describing what they see, what they suspect, what they rule out. But the richness is trapped by the form: unstructured text, specialised jargon, abbreviations, hedging phrases, multiple negations, deliberate contradictions. To exploit those reports at scale — for example to identify every hepatocellular carcinoma case in a hospital PACS, or to build a training set for an imaging model — they must first be labelled, that is, converted into structured variables (binary categories: such-and-such diagnosis present / absent).

Three approaches dominate. Rule-based labellers (CheXpert, NegBio, MetaMap) rely on term dictionaries and negation rules. They are fast and transparent, but collapse as soon as wording strays from expected patterns. Fine-tuning pre-trained models (BERT, PubMedBERT) demands, in turn, thousands of annotated reports per category — an annotation cost out of reach for a hospital service without dedicated research funding. Large language models (GPT-4 and successors) partly solve the problem through their ability to generalise without specific training (zero-shot), but require sending patient reports to an external vendor — incompatible with most regulatory frameworks on health data. It is in this gap that this paper sits.

The method

The study is run by Ying-Jia Lin and Hung-Yu Kao (National Tsing Hua University and Chang Gung University) with a clinical team from Chang Gung Memorial Hospital and Sijhih Cathay General Hospital (Taiwan). Preprint posted on arXiv on 19 May 2026 (revised on 20 May), camera-ready version accepted at BioNLP 2026 in the ACL track. DOI 10.48550/arXiv.2605.20052, code published under an open licence on GitHub. Public Taiwanese funding (NSTC), no declared conflict of interest.

The setup rests on three elements. First element: a lightweight backbone, PubMedBERT, a 110-million-parameter BERT pre-trained by Microsoft Research on the entire PubMed corpus. It is a text-encoding encoder-only model — much smaller than consumer LLMs, but specifically trained on biomedical literature, hence well calibrated to medical vocabulary.

Second element: prompt-tuning via masked language modelling. Rather than adding a classification layer on top of PubMedBERT (the classical fine-tuning approach), PromptRad reformulates the labelling task as a cloze test. The report is inserted into a template such as "r The radiology report is related to [MASK].", where [MASK] is the word to predict. The model is trained to predict one vocabulary word per target category. This formulation has two virtues: it preserves the model's initial training (no randomly initialised layer) and it directly exploits the "masking head" that PubMedBERT spent its entire pre-training perfecting.

Third element: the UMLS-enriched multi-word verbalizer. A verbalizer, in prompt-tuning, is the function that maps a class to one or more vocabulary words. Naively, one would simply take the class name ("hepatocellular carcinoma"). The authors go further: they query SNOMED CT (via the UMLS Metathesaurus, the US National Library of Medicine's large medical thesaurus) for each class and add the synonyms used in clinical practice. Hepatocellular carcinoma becomes { "hcc", "hepatoma" }; steatosis becomes { "steatosis", "fatty liver" }; post-treatment includes the acronyms RFA (radiofrequency ablation) and TACE (transarterial chemoembolization). At decision time, the model aggregates the maximum probability across all synonyms for a category. This injection of medical knowledge is what distinguishes PromptRad from a generic prompt-tuning.

The dataset comprises 1,098 liver CT reports, de-identified, written in English, from a large Taiwanese medical centre over 2008–2017. Strict chronological split: 773 training reports (2008–2014), 325 test reports (2015–2017). Seven liver-lesion categories annotated by two senior radiologists, with the explicit instruction to mark suspicious mentions as positive to avoid false negatives. In the low-resource configuration, only 32 reports are sampled from the training pool, stratified to preserve class distribution; results are averaged over five draws. Study approved by the institutional review board of the participating centre.

Comparators: three families. Rule-based: a dictionary labeller, MetaMap and NegBio. Fine-tuning-based: standard PubMedBERT, and two hybrid variants where fine-tuning is preceded by MetaMap or NegBio preprocessing. Large model: GPT-4 zero-shot, and GPT-4 with in-context learning using three examples.

The results

The headline result is a macro F1 of 89.2% (± 1.0) for PromptRad+AutoT (the variant with automatic template generation via T5), slightly above GPT-4 zero-shot at 88.7%, and well above standard PubMedBERT fine-tuning at 58.6% (± 10.0). The manual PromptRad variant reaches 83.7% (± 2.1). NegBio, the best of the rule-based labellers, tops out at 76.6%. Liver metastasis detection therefore moves from 27.5% (NegBio) or 54.9% (PubMedBERT) to 84.7% (PromptRad+AutoT) — a substantial gap on a minority category (101 training cases, 46 test cases). On haemangioma, another rare category, the jump is from 37.7% (PubMedBERT) to 92.4%.

On negation handling, the paper reports (Figure 4) that PromptRad+AutoT outperforms GPT-4, NegBio and MetaMap at distinguishing explicit negative phrasings ("no liver cirrhosis", "R/O metastasis") across the three categories HCC, Cirrhosis and Metastasis. NegBio in particular collapses on cirrhosis because its syntactic-parsing logic requires complete sentences, and radiologists often write telegraphically. The gap is widest where reading the text requires semantic understanding — exactly the home turf of a language model.

Clinical translation. To anchor the orders of magnitude across 1,000 liver CT reports automatically labelled in routine practice: a PubMedBERT labeller fine-tuned on 32 examples would miss roughly 45 steatosis mentions out of 100, against 3 for PromptRad+AutoT. On liver metastasis, NegBio would miss about 72 real cases out of 100, against 15 for PromptRad+AutoT — at the cost of a few extra false positives in absolute terms. A negative report flagged as positive in error has a human-review cost, but no direct clinical cost. For a radiology service trying to build a liver-tumour registry from its PACS, the gap between 28% and 85% recall on metastasis changes the nature of the registry — from unusable to usable.

What's good

Three specific strengths.

Data efficiency is extreme and quantified. 32 annotated reports versus the thousands required by standard BERT fine-tuning: that's the gap that makes the method deployable in a real hospital service, where no radiologist has the slack to annotate 5,000 cases. The paper further shows (section 5.4) that performance keeps climbing with more data: at 128 examples, PromptRad+AutoT exceeds 90% macro F1. The curve is honestly traced, without cherry-picking a favourable threshold.

Validation uses a chronological split, not a random split. Train on 2008–2014, test on 2015–2017. That is methodological discipline which eliminates at least one form of data leakage — the scenario where the same patient or the same radiologist would appear in both train and test because a random draw split them. On clinical data, this is the right practice, still too often ignored in the medical-NLP literature.

The code is published under a permissive licence, the backbone model is open. PubMedBERT is MIT-licensed, the PromptRad code is published under CC BY 4.0 on GitHub, and the method depends on no proprietary service at inference time (the OpenAI API is used only for comparison). Concretely, a hospital can deploy the full pipeline locally, without sending a single report to an external vendor — which is rare in the current wave of papers that simply prompt GPT-4 and publish.

What's less good

Three precise limitations.

Data from a single centre, a single modality, a single language. 1,098 liver CT reports from one Taiwanese hospital, written in English. This is population bias in its classic form: nothing guarantees that the model will survive a change of service (paediatric radiology), modality (MRI), specialty (cardiology reports), or language (hospital French, which freely mixes French and English with its own abbreviations). The authors explicitly acknowledge this in their limitations section. For a French-speaking hospital, retraining from scratch would be required, and the scarcity of a well-calibrated French PubMedBERT is a real obstacle.

The GPT-4 comparator is used zero-shot without prompt optimisation, which disadvantages it. The 2024–2026 literature has shown extensively that a carefully designed prompt, with well-chosen examples and explicit chain-of-thought, can buy GPT-4 5 to 10 points on clinical tasks. The in-context learning version tested here uses "three random examples" — probably not the best choice. The failure mode to flag is the biased comparator: it is unclear whether PromptRad beats a poorly used GPT-4 or a GPT-4 in optimal conditions. A comparison with fine-tuned GPT-4, or with an open-weights model of comparable size (Llama-3-Med, BiomedCLIP), would have been more instructive.

The reference metric is F1 on seven fixed categories, which does not cover the most demanding clinical scenario. In practice, a radiology service needs to label potentially dozens of categories — including rare findings that do not appear once in 32 training examples. The paper says nothing about PromptRad's degradation on very low-prevalence categories (1 case in 1,000, say), nor on incidental findings outside the UMLS vocabulary. The classic misleading metric trap looms: a 89% macro F1 on seven well-represented categories can mask a collapse on the eighth. A prospective evaluation on the long tail of radiological findings is missing to qualify the method for production.

What it changes

For the medical-NLP research community, the methodological message matters. Since 2023, many teams have stopped fine-tuning compact models because GPT-4 zero-shot seemed to absolve them of the effort. This paper is a reminder that a well-adapted backbone and a verbalizer enriched with medical terminology can match GPT-4 on a precise task, with 1,000 times fewer parameters, hence negligible inference cost and no dependence on an external API. It's a useful argument in the debate over hospital technical sovereignty. Conversely, the paper does not claim to replace GPT-4 on open-ended tasks, and does not — honest critique must keep that boundary in mind.

For clinicians and imaging services, the operational lever is concrete. A service that wants to retrospectively index its PACS — for example to assemble a hepatocellular-carcinoma cohort for research or quality review — can now hope to do so with a few radiologist-hours of annotation effort, rather than radiologist-months. The practical question, however, is integration: who hosts the model, who maintains it, who audits its false negatives. None of these is solved by the paper, which stops at the benchmark.

For patients and the public, the takeaway is indirect. No one will ever consult PromptRad. But the pipelines downstream of it — feeding cancer registries, epidemiological studies on liver lesions, training sets for future imaging models — will have a cascade effect on the quality of the medical knowledge produced. A labelling infrastructure going from 60% to 89% average accuracy means a scientific literature that drifts less, better-constituted cohorts, and ultimately more reliable clinical recommendations. The benefit is invisible but real.

Further reading

arXiv preprint 2605.20052 is open access on arxiv.org; direct PDF here. Code and scripts are published at github.com/ila-lab/PromptRad. For PubMedBERT, Gu et al. (2021) is accessible via DOI 10.1145/3458754. For the conceptual foundations of prompt-tuning, see Liu et al., ACM Computing Surveys, 2023 (DOI 10.1145/3560815). For the UMLS Metathesaurus, the entry point is the National Library of Medicine. For a recent overview of failure modes specific to clinical LLMs, see our decryption of the Auger 2026 study.