Research Papers

ARXIV Cancer: unknown Method: hierarchical vision-language modeling

U-VLM: Hierarchical Vision Language Modeling for Report Generation

Pengcheng Shi, Minghui Zhang, Kehan Song, Jiaqi Liu, Yun Gu, Xinglin Zhang
Published 2026-02-28 05:43

The paper presents U-VLM, a novel hierarchical vision-language model designed for automated radiology report generation from 3D medical imaging. It addresses limitations of existing models by incorporating segmentation-pretrained encoders and multi-layer visual feature injection. The proposed method demonstrates state-of-the-art performance on benchmark datasets, indicating that effective vision encoder pretraining can surpass the advantages of larger pre-trained language models. Ablation studies further confirm the benefits of progressive pretraining and multi-layer injection.

Read abstract

Automated radiology report generation is key for reducing radiologist workload and improving diagnostic consistency, yet generating accurate reports for 3D medical imaging remains challenging. Existing vision-language models face two limitations: they do not leverage segmentation-pretrained encoders, and they inject visual features only at the input layer of language models, losing multi-scale information. We propose U-VLM, which enables hierarchical vision-language modeling in both training and architecture: (1) progressive training from segmentation to classification to report generation, and (2) multi-layer visual injection that routes U-Net encoder features to corresponding language model layers. Each training stage can leverage different datasets without unified annotations. U-VLM achieves state-of-the-art performance on CT-RATE (F1: 0.414 vs 0.258, BLEU-mean: 0.349 vs 0.305) and AbdomenAtlas 3.0 (F1: 0.624 vs 0.518 for segmentation-based detection) using only a 0.1B decoder trained from scratch, demonstrating that well-designed vision encoder pretraining outweighs the benefits of 7B+ pre-trained language models. Ablation studies show that progressive pretraining significantly improves F1, while multi-layer injection improves BLEU-mean. Code is available at https://github.com/yinghemedical/U-VLM.

ARXIV Cancer: general cancer Method: large language model

Improving Automatic Summarization of Radiology Reports through Mid-Training of Large Language Models

Mengxian Lyu, Cheng Peng, Ziyi Chen, Mengyuan Zhang, Jieting Li Lu, Yonghui Wu
Published 2026-02-28 03:36

This study focuses on improving the automatic summarization of radiology reports by employing a mid-training method for large language models (LLMs). The authors explored various adaptation strategies and found that their mid-trained model, GatorTronT5-Radio, significantly outperformed traditional models in both text-based and factuality measures. The findings suggest that incorporating mid-training can enhance performance and address learning barriers in summarization tasks.

Read abstract

Automatic summarization of radiology reports is an essential application to reduce the burden on physicians. Previous studies have widely used the "pre-training, fine-tuning" strategy to adapt large language models (LLMs) for summarization. This study proposed a subdomain adaptation through a mid-training method to improve summarization. We explored three adaptation strategies: (1) general-domain pre-training, (2) clinical-domain pre-training, and (3) clinical-domain pre-training followed by subdomain mid-training. We developed models using large-scale clinical text from the University of Florida (UF) Health and conducted mid-training and fine-tuning experiments using widely used benchmark datasets including OpenI and MIMIC-CXR. The experimental results show that the mid-trained model, GatorTronT5-Radio, achieved the best performance, outperforming models without mid-training in both text-based measures (ROUGE-L) and factuality measures (RadGraph-F1). Our mid-training methods also demonstrate better few-shot learning and could alleviate the "cold start" problem reported in previous studies as a learning barrier. Our findings support the use of "pre-training, mid-training, fine-tuning," instead of the widely used direct fine-tuning strategy.

ARXIV Cancer: non-small cell lung carcinoma Method: quantum-inspired multi-class classifier

Pretty Good Measurement for Radiomics: A Quantum-Inspired Multi-Class Classifier for Lung Cancer Subtyping and Prostate Cancer Risk Stratification

Giuseppe Sergioli, Carlo Cuccu, Giovanni Pasini, Alessandro Stefano, Giorgio Russo, Andrés Camilo Granda Arango, Roberto Giuntini
Published 2026-02-27 18:58

This study presents a quantum-inspired multi-class classifier known as the Pretty Good Measurement (PGM) for the subtyping of lung cancer and risk stratification of prostate cancer. The method utilizes a novel operator-valued decision rule to perform classification without reducing it to pairwise comparisons. The PGM classifier demonstrates competitive performance against established classical methods, particularly excelling in the classification tasks for non-small-cell lung carcinoma and maintaining strong results in prostate cancer risk assessment.

Read abstract

We investigate a quantum-inspired approach to supervised multi-class classification based on the \emph{Pretty Good Measurement} (PGM), viewed as an operator-valued decision rule derived from quantum state discrimination. The method associates each class with an encoded mixed state and performs classification through a single POVM construction, thus providing a genuinely multi-class strategy without reduction to pairwise or one-vs-rest schemes. In this perspective, classification is reformulated as the discrimination of a finite ensemble of class-dependent density operators, with performance governed by the geometry induced by the encoding map and by the overlap structure among classes. To assess the practical scope of this framework, we apply the PGM-based classifier to two biomedical radiomics case studies: histopathological subtyping of non-small-cell lung carcinoma (NSCLC) and prostate cancer (PCa) risk stratification. The evaluation is conducted under protocols aligned with previously reported radiomics studies, enabling direct comparison with established classical baselines. The results show that the PGM-based classifier is consistently competitive and, in several settings, improves upon standard methods. In particular, the method performs especially well in the NSCLC binary and three-class tasks, while remaining competitive in the four-class case, where increased class overlap yields a more demanding discrimination geometry. In the PCa study, the PGM classifier remains close to the strongest ensemble baseline and exhibits clinically relevant sensitivity--specificity trade-offs across feature-selection scenarios.

ARXIV Cancer: general cancer Method: latent manifold compaction

Histopathology Image Normalization via Latent Manifold Compaction

Xiaolong Zhang, Jianwei Zhang, Selim Sevim, Emek Demir, Ece Eksi, Xubo Song
Published 2026-02-27 18:26

This study presents Latent Manifold Compaction (LMC), an unsupervised representation learning framework designed to address batch effects in histopathology images caused by variations in staining protocols and acquisition methods. LMC achieves image harmonization by learning batch-invariant embeddings, allowing for improved generalization to unseen target domain data. The method was evaluated on various benchmarks, demonstrating significant reductions in batch-induced separations and outperforming existing normalization techniques in classification and detection tasks.

Read abstract

Batch effects arising from technical variations in histopathology staining protocols, scanners, and acquisition pipelines pose a persistent challenge for computational pathology, hindering cross-batch generalization and limiting reliable deployment of models across clinical sites. In this work, we introduce Latent Manifold Compaction (LMC), an unsupervised representation learning framework that performs image harmonization by learning batch-invariant embeddings from a single source dataset through explicit compaction of stain-induced latent manifolds. This allows LMC to generalize to target domain data unseen during training. Evaluated on three challenging public and in-house benchmarks, LMC substantially reduces batch-induced separations across multiple datasets and consistently outperforms state-of-the-art normalization methods in downstream cross-batch classification and detection tasks, enabling superior generalization.

ARXIV Cancer: unknown Method: transformer architecture

MuViT: Multi-Resolution Vision Transformers for Learning Across Scales in Microscopy

Albert Dominguez Mantes, Gioele La Manno, Martin Weigert
Published 2026-02-27 17:48

This paper presents MuViT, a novel transformer architecture designed to analyze multi-resolution microscopy images. By embedding image patches into a shared world-coordinate system, MuViT effectively integrates wide-field context with high-resolution details. The method shows significant improvements in performance across various benchmarks, including kidney histopathology and mouse-brain microscopy, compared to existing models. The findings highlight the advantages of explicit world-coordinate modeling for large-scale microscopy analysis.

Read abstract

Modern microscopy routinely produces gigapixel images that contain structures across multiple spatial scales, from fine cellular morphology to broader tissue organization. Many analysis tasks require combining these scales, yet most vision models operate at a single resolution or derive multi-scale features from one view, limiting their ability to exploit the inherently multi-resolution nature of microscopy data. We introduce MuViT, a transformer architecture built to fuse true multi-resolution observations from the same underlying image. MuViT embeds all patches into a shared world-coordinate system and extends rotary positional embeddings to these coordinates, enabling attention to integrate wide-field context with high-resolution detail within a single encoder. Across synthetic benchmarks, kidney histopathology, and high-resolution mouse-brain microscopy, MuViT delivers consistent improvements over strong ViT and CNN baselines. Multi-resolution MAE pretraining further produces scale-consistent representations that enhance downstream tasks. These results demonstrate that explicit world-coordinate modelling provides a simple yet powerful mechanism for leveraging multi-resolution information in large-scale microscopy analysis.

ARXIV Cancer: lung cancer Method: deformable registration

GLIDE-Reg: Global-to-Local Deformable Registration Using Co-Optimized Foundation and Handcrafted Features

Yunzheng Zhu, Aichi Chien, Kimaya kulkarni, Luoting Zhuang, Stephen Park, Ricky Savjani, Daniel Low, William Hsu
Published 2026-02-27 17:19

This paper presents GLIDE-Reg, a novel method for deformable registration in medical imaging that optimizes both a registration field and a learnable dimensionality reduction module. The method integrates global semantic cues with local descriptors to enhance robustness and generalizability across varying spatial resolutions and anatomical coverage. GLIDE-Reg demonstrates superior performance in average dice similarity coefficients and target registration errors compared to existing methods, particularly in the context of nodule tracking for early-stage lung cancer diagnosis.

Read abstract

Deformable registration is crucial in medical imaging. Several existing applications include lesion tracking, probabilistic atlas generation, and treatment response evaluation. However, current methods often lack robustness and generalizability across two key factors: spatial resolution and differences in anatomical coverage. We jointly optimize a registration field and a learnable dimensionality reduction module so that compressed VFM embeddings remain registration-relevant, and fuse these global semantic cues with MIND local descriptors. GLIDE-Reg achieves average dice similarity coefficients (DSC) across 6 anatomical structures of 0.859, 0.862, and 0.901 in two public cohorts (Lung250M and NLST) and one institution cohort (UCLA5DCT), and outperforms the state-of-the-art DEEDS (0.834, 0.858, 0.900) with relative improvements of 3.0%, 0.5%, and 0.1%. For target registration errors, GLIDE-Reg achieves 1.58 mm on Lung250M landmarks (compared to 1.25 mm on corrField and 1.91 mm on DEEDS) and 1.11 mm on NLST nodule centers (compared to 1.11 mm on DEEDS). The substantiated performance on the nodule centers also demonstrates its robustness across challenging downstream tasks, such as nodule tracking, which is an essential prior step for early-stage lung cancer diagnosis.

ARXIV Cancer: unknown Method: neurosymbolic verification

Toward Guarantees for Clinical Reasoning in Vision Language Models via Formal Verification

Vikash Singh, Debargha Ganguly, Haotian Yu, Chengwei Zhou, Prerna Singh, Brandon Lee, Vipin Chaudhary, Gourav Datta
Published 2026-02-27 15:49

This paper presents a neurosymbolic verification framework aimed at improving the clinical reasoning capabilities of vision-language models (VLMs) used in drafting radiology reports. The framework audits the internal consistency of reports generated by VLMs by formalizing radiographic findings and verifying diagnostic claims against a clinical knowledge base. The evaluation of seven VLMs on chest X-ray benchmarks reveals various reasoning failure modes, and the proposed method enhances diagnostic soundness and precision in generative clinical assistants.

Read abstract

Vision-language models (VLMs) show promise in drafting radiology reports, yet they frequently suffer from logical inconsistencies, generating diagnostic impressions unsupported by their own perceptual findings or missing logically entailed conclusions. Standard lexical metrics heavily penalize clinical paraphrasing and fail to capture these deductive failures in reference-free settings. Toward guarantees for clinical reasoning, we introduce a neurosymbolic verification framework that deterministically audits the internal consistency of VLM-generated reports. Our pipeline autoformalizes free-text radiographic findings into structured propositional evidence, utilizing an SMT solver (Z3) and a clinical knowledge base to verify whether each diagnostic claim is mathematically entailed, hallucinated, or omitted. Evaluating seven VLMs across five chest X-ray benchmarks, our verifier exposes distinct reasoning failure modes, such as conservative observation and stochastic hallucination, that remain invisible to traditional metrics. On labeled datasets, enforcing solver-backed entailment acts as a rigorous post-hoc guarantee, systematically eliminating unsupported hallucinations to significantly increase diagnostic soundness and precision in generative clinical assistants.

ARXIV Cancer: breast cancer Method: cascaded multi-agent framework

Experience-Guided Self-Adaptive Cascaded Agents for Breast Cancer Screening and Diagnosis with Reduced Biopsy Referrals

Pramit Saha, Mohammad Alsharid, Joshua Strong, J. Alison Noble
Published 2026-02-27 10:48

This paper presents the BUSD-Agent, an experience-guided cascaded multi-agent framework designed for breast ultrasound screening and diagnosis. The framework aims to minimize unnecessary biopsy referrals by employing a two-stage decision-making process that filters benign cases and escalates higher-risk cases for further evaluation. The evaluation demonstrates significant reductions in diagnostic escalation and biopsy referrals while improving screening and diagnostic specificity.

Read abstract

We propose an experience-guided cascaded multi-agent framework for Breast Ultrasound Screening and Diagnosis, called BUSD-Agent, that aims to reduce diagnostic escalation and unnecessary biopsy referrals. Our framework models screening and diagnosis as a two-stage, selective decision-making process. A lightweight `screening clinic' agent, restricted to classification models as tools, selectively filters out benign and normal cases from further diagnostic escalation when malignancy risk and uncertainty are estimated as low. Cases that have higher risks are escalated to the `diagnostic clinic' agent, which integrates richer perception and radiological description tools to make a secondary decision on biopsy referral. To improve agent performance, past records of pathology-confirmed outcomes along with image embeddings, model predictions, and historical agent actions are stored in a memory bank as structured decision trajectories. For each new case, BUSD-Agent retrieves similar past cases based on image, model response and confidence similarity to condition the agent's current decision policy. This enables retrieval-conditioned in-context adaptation that dynamically adjusts model trust and escalation thresholds from prior experiences without parameter updates. Evaluation across 10 breast ultrasound datasets shows that the proposed experience-guided workflow reduces diagnostic escalation in BUSD-Agent from 84.95% to 58.72% and overall biopsy referrals from 59.50% to 37.08%, compared to the same architecture without trajectory conditioning, while improving average screening specificity by 68.48% and diagnostic specificity by 6.33%.

ARXIV Cancer: unknown Method: continual learning

Footprint-Guided Exemplar-Free Continual Histopathology Report Generation

Pratibha Kumari, Daniel Reisenbüchler, Afshin Bozorgpour, yousef Sadegheih, Priyankar Choudhary, Dorit Merhof
Published 2026-02-27 08:58

This paper presents an exemplar-free continual learning framework for generating pathology reports from whole-slide images (WSIs). The method utilizes a compact domain footprint in a frozen patch-embedding space to synthesize pseudo-WSI representations, allowing for effective adaptation to new reporting conventions without retaining past data. The approach demonstrates improved performance on continual learning benchmarks, indicating its potential for deployment in dynamic clinical environments.

Read abstract

Rapid progress in vision-language modeling has enabled pathology report generation from gigapixel whole-slide images, but most approaches assume static training with simultaneous access to all data. In clinical deployment, however, new organs, institutions, and reporting conventions emerge over time, and sequential fine-tuning can cause catastrophic forgetting. We introduce an exemplar-free continual learning framework for WSI-to-report generation that avoids storing raw slides or patch exemplars. The core idea is a compact domain footprint built in a frozen patch-embedding space: a small codebook of representative morphology tokens together with slide-level co-occurrence summaries and lightweight patch-count priors. These footprints support generative replay by synthesizing pseudo-WSI representations that reflect domain-specific morphological mixtures, while a teacher snapshot provides pseudo-reports to supervise the updated model without retaining past data. To address shifting reporting conventions, we distill domain-specific linguistic characteristics into a compact style descriptor and use it to steer generation. At inference, the model identifies the most compatible descriptor directly from the slide signal, enabling domain-agnostic setup without requiring explicit domain identifiers. Evaluated across multiple public continual learning benchmarks, our approach outperforms exemplar-free and limited-buffer rehearsal baselines, highlighting footprint-based generative replay as a practical solution for deployment in evolving clinical settings.

ARXIV Cancer: unknown Method: adaptive uncertainty-aware framework

AdURA-Net: Adaptive Uncertainty and Region-Aware Network

Antik Aich Roy, Ujjwal Bhattacharya
Published 2026-02-27 08:56

This paper presents AdURA-Net, an adaptive uncertainty-aware framework designed for reliable classification of thoracic diseases. The model addresses the challenges of uncertainty in clinical decision-making by allowing the system to indicate when it lacks confidence in its predictions. Key features include adaptive dilated convolution and multiscale deformable alignment integrated with a Densenet architecture, along with a novel Dual Head Loss function to enhance learning from uncertain labels.

Read abstract

One of the common issues in clinical decision-making is the presence of uncertainty, which often arises due to ambiguity in radiology reports, which often reflect genuine diagnostic uncertainty or limitations of automated label extraction in various complex cases. Especially the case of multilabel datasets such as CheXpert, MIMIC-CXR, etc., which contain labels such as positive, negative, and uncertain. In clinical decision-making, the uncertain label plays a tricky role as the model should not be forced to provide a confident prediction in the absence of sufficient evidence. The ability of the model to say it does not understand whenever it is not confident is crucial, especially in the cases of clinical decision-making involving high risks. Here, we propose AdURA-Net, a geometry-driven adaptive uncertainty-aware framework for reliable thoracic disease classification. The key highlights of the proposed model are: a) Adaptive dilated convolution and multiscale deformable alignment coupled with the backbone Densenet architecture capturing the anatomical complexities of the medical images, and b) Dual Head Loss, which combines masked binary cross entropy with logit and a Dirichlet evidential learning objective.

Find the papers that actually matter