Research Papers

ARXIV Cancer: breast cancer Method: pathology foundation models

Scanner-Induced Domain Shifts Undermine the Robustness of Pathology Foundation Models

Erik Thiringer, Fredrik K. Gustafsson, Kajsa Ledesma Eriksson, Mattias Rantalainen
Published 2026-01-07 18:24

This study evaluates the robustness of pathology foundation models (PFMs) to scanner-induced variability using a dataset of 384 breast cancer whole-slide images. The research highlights that current PFMs are not invariant to such variability, leading to scanner-dependent biases that can affect clinical reliability. Despite stable AUC scores, the embedding spaces of the models exhibit pronounced scanner-specific variability, indicating a critical failure mode. The findings suggest a need for improved evaluation methods that focus on embedding stability and calibration under realistic conditions.

Read abstract

Pathology foundation models (PFMs) have become central to computational pathology, aiming to offer general encoders for feature extraction from whole-slide images (WSIs). Despite strong benchmark performance, PFM robustness to real-world technical domain shifts, such as variability from whole-slide scanner devices, remains poorly understood. We systematically evaluated the robustness of 14 PFMs to scanner-induced variability, including state-of-the-art models, earlier self-supervised models, and a baseline trained on natural images. Using a multiscanner dataset of 384 breast cancer WSIs scanned on five devices, we isolated scanner effects independently from biological and laboratory confounders. Robustness is assessed via complementary unsupervised embedding analyses and a set of clinicopathological supervised prediction tasks. Our results demonstrate that current PFMs are not invariant to scanner-induced domain shifts. Most models encode pronounced scanner-specific variability in their embedding spaces. While AUC often remains stable, this masks a critical failure mode: scanner variability systematically alters the embedding space and impacts calibration of downstream model predictions, resulting in scanner-dependent bias that can impact reliability in clinical use cases. We further show that robustness is not a simple function of training data scale, model size, or model recency. None of the models provided reliable robustness against scanner-induced variability. While the models trained on the most diverse data, here represented by vision-language models, appear to have an advantage with respect to robustness, they underperformed on downstream supervised tasks. We conclude that development and evaluation of PFMs requires moving beyond accuracy-centric benchmarks toward explicit evaluation and optimisation of embedding stability and calibration under realistic acquisition variability.

ARXIV Cancer: low-grade glioma Method: ensemble model

Ensemble Models for Predicting Treatment Response in Pediatric Low-Grade Glioma Managed with Chemotherapy

Max Bengtsson, Elif Keles, Angela J. Waanders, Ulas Bagci
Published 2026-01-07 13:10

This paper presents a novel pipeline for predicting chemotherapy response in pediatric low-grade glioma using pre-treatment MRI and clinical data. The method combines a segmentation framework with radiomic feature extraction through an ensemble of a Swin UNETR encoder and an XGBoost classifier. The results indicate that the Swin-Ensemble model outperforms other approaches, achieving a precision of 0.68 and a recall of 0.85 for non-effective cases, highlighting its potential for personalized therapy response prediction.

Read abstract

In this paper, we introduce a novel pipeline for predicting chemotherapy response in pediatric brain tumors that are not amenable to complete surgical resection, using pre-treatment magnetic resonance imaging combined with clinical information. Our method integrates a state-of-the-art pediatric brain tumor segmentation framework with radiomic feature extraction and clinical data through an ensemble of a Swin UNETR encoder and XGBoost classifier. The segmentation model delineates four tumor subregions enhancing tumor, non-enhancing tumor, cystic component and edema which are used to extract imaging biomarkers and generate predictive features. The Swin UNETR network classifies the response to treatment directly from these segmented MRI scans, while XGBoost predicts response using radiomics and clinical variables including legal sex, ethnicity, race, age at event (in days), molecular subtype, tumor locations, initial surgery status, metastatic status, metastasis location, chemotherapy type, protocol name and chemotherapy agents. The ensemble output provides a non-invasive estimate of chemotherapy response in this historically challenging population characterized by lower progression-free survival. Among compared approaches, our Swin-Ensemble achieved the best performance (precision for non effective cases=0.68, recall for non effective cases=0.85, precision for chemotherapy effective cases=0.64 and overall accuracy=0.69), outperforming Mamba-FeatureFuse, Swin UNETR encoder, and Swin-FeatureFuse models. Our findings suggest that this ensemble framework represents a promising step toward personalized therapy response prediction for pediatric low-grade glioma patients in need of chemotherapy treatment who are not suitable for complete surgical resection, a population with significantly lower progression free survival and for whom chemotherapy remains the primary treatment option.

ARXIV Cancer: general cancer Method: multimodal learning

RadDiff: Describing Differences in Radiology Image Sets with Natural Language

Xiaoxian Shen, Yuhui Zhang, Sahithi Ankireddy, Xiaohan Wang, Maya Varma, Henry Guo, Curtis Langlotz, Serena Yeung-Levy
Published 2026-01-07 09:25

The paper introduces RadDiff, a multimodal system designed to perform comparative reasoning on radiology image sets to identify clinically significant differences. It employs a proposer-ranker framework and integrates medical knowledge with vision-language models, enhancing its reasoning capabilities through iterative hypothesis refinement and targeted visual search. Evaluation on the RadDiffBench benchmark demonstrates RadDiff's effectiveness, achieving notable accuracy in identifying differences across various clinical tasks.

Read abstract

Understanding how two radiology image sets differ is critical for generating clinical insights and for interpreting medical AI systems. We introduce RadDiff, a multimodal agentic system that performs radiologist-style comparative reasoning to describe clinically meaningful differences between paired radiology studies. RadDiff builds on a proposer-ranker framework from VisDiff, and incorporates four innovations inspired by real diagnostic workflows: (1) medical knowledge injection through domain-adapted vision-language models; (2) multimodal reasoning that integrates images with their clinical reports; (3) iterative hypothesis refinement across multiple reasoning rounds; and (4) targeted visual search that localizes and zooms in on salient regions to capture subtle findings. To evaluate RadDiff, we construct RadDiffBench, a challenging benchmark comprising 57 expert-validated radiology study pairs with ground-truth difference descriptions. On RadDiffBench, RadDiff achieves 47% accuracy, and 50% accuracy when guided by ground-truth reports, significantly outperforming the general-domain VisDiff baseline. We further demonstrate RadDiff's versatility across diverse clinical tasks, including COVID-19 phenotype comparison, racial subgroup analysis, and discovery of survival-related imaging features. Together, RadDiff and RadDiffBench provide the first method-and-benchmark foundation for systematically uncovering meaningful differences in radiological data.

ARXIV Cancer: prostate cancer Method: large language model

CPGPrompt: Translating Clinical Guidelines into LLM-Executable Decision Support

Ruiqi Deng, Geoffrey Martin, Tony Wang, Gongbo Zhang, Yi Liu, Chunhua Weng, Yanshan Wang, Justin F Rousseau, Yifan Peng
Published 2026-01-07 00:05

This paper presents CPGPrompt, an auto-prompting system designed to convert clinical practice guidelines into formats executable by large language models (LLMs). The framework translates guidelines into structured decision trees, allowing for dynamic navigation during patient case evaluations. The system was validated using synthetic vignettes across various domains, including prostate cancer, achieving strong performance in binary specialty-referral decisions, while showing variability in multi-class pathway assignments.

Read abstract

Clinical practice guidelines (CPGs) provide evidence-based recommendations for patient care; however, integrating them into Artificial Intelligence (AI) remains challenging. Previous approaches, such as rule-based systems, face significant limitations, including poor interpretability, inconsistent adherence to guidelines, and narrow domain applicability. To address this, we develop and validate CPGPrompt, an auto-prompting system that converts narrative clinical guidelines into large language models (LLMs). Our framework translates CPGs into structured decision trees and utilizes an LLM to dynamically navigate them for patient case evaluation. Synthetic vignettes were generated across three domains (headache, lower back pain, and prostate cancer) and distributed into four categories to test different decision scenarios. System performance was assessed on both binary specialty-referral decisions and fine-grained pathway-classification tasks. The binary specialty referral classification achieved consistently strong performance across all domains (F1: 0.85-1.00), with high recall (1.00 $\pm$ 0.00). In contrast, multi-class pathway assignment showed reduced performance, with domain-specific variations: headache (F1: 0.47), lower back pain (F1: 0.72), and prostate cancer (F1: 0.77). Domain-specific performance differences reflected the structure of each guideline. The headache guideline highlighted challenges with negation handling. The lower back pain guideline required temporal reasoning. In contrast, prostate cancer pathways benefited from quantifiable laboratory tests, resulting in more reliable decision-making.

ARXIV Cancer: pancreatic cancer Method: deep learning

Inferring Clinically Relevant Molecular Subtypes of Pancreatic Cancer from Routine Histopathology Using Deep Learning

Abdul Rehman Akbar, Alejandro Levya, Ashwini Esnakula, Elshad Hasanov, Anne Noonan, Lingbin Meng, Susan Tsai, Vaibhav Sahai, Midhun Malla, Sarbajit Mukherjee, Upender Manne, Anil Parwani, Wei Chen, Ashish Manne, Muhammad Khalid Khan Niazi
Published 2026-01-06 20:52

This study presents PanSubNet, a deep learning framework designed to predict clinically relevant molecular subtypes of pancreatic cancer directly from standard H&E-stained whole slide images. The model was developed using data from over 1,000 patients and demonstrated high accuracy in internal and external validation. PanSubNet aims to facilitate rapid and cost-effective molecular stratification, enhancing the management of pancreatic ductal adenocarcinoma (PDAC) in clinical settings.

Read abstract

Molecular subtyping of PDAC into basal-like and classical has established prognostic and predictive value. However, its use in clinical practice is limited by cost, turnaround time, and tissue requirements, thereby restricting its application in the management of PDAC. We introduce PanSubNet, an interpretable deep learning framework that predicts therapy-relevant molecular subtypes directly from standard H&E-stained WSIs. PanSubNet was developed using data from 1,055 patients across two multi-institutional cohorts (PANCAN, n=846; TCGA, n=209) with paired histology and RNA-seq data. Ground-truth labels were derived using the validated Moffitt 50-gene signature refined by GATA6 expression. The model employs dual-scale architecture that fuses cellular-level morphology with tissue-level architecture, leveraging attention mechanisms for multi-scale representation learning and transparent feature attribution. On internal validation within PANCAN using five-fold cross-validation, PanSubNet achieved mean AUC of 88.5% with balanced sensitivity and specificity. External validation on the independent TCGA cohort without fine-tuning demonstrated robust generalizability (AUC 84.0%). PanSubNet preserved and, in metastatic disease, strengthened prognostic stratification compared to RNA-seq based labels. Prediction uncertainty linked to intermediate transcriptional states, not classification noise. Model predictions are aligned with established transcriptomic programs, differentiation markers, and DNA damage repair signatures. By enabling rapid, cost-effective molecular stratification from routine H&E-stained slides, PanSubNet offers a clinically deployable and interpretable tool for genetic subtyping. We are gathering data from two institutions to validate and assess real-world performance, supporting integration into digital pathology workflows and advancing precision oncology for PDAC.

ARXIV Cancer: general cancer Method: small language models

Multi-RADS Synthetic Radiology Report Dataset and Head-to-Head Benchmarking of 41 Open-Weight and Proprietary Language Models

Kartik Bose, Abhinandan Kumar, Raghuraman Soundararajan, Priya Mudgil, Samonee Ralmilay, Niharika Dutta, Manphool Singhal, Arun Kumar, Saugata Sen, Anurima Patra, Priya Ghosh, Abanti Das, Amit Gupta, Ashish Verma, Dipin Sudhakaran, Ekta Dhamija, Himangi Unde, Ishan Kumar, Krithika Rangarajan, Prerna Garg, Rachel Sequeira, Sudhin Shylendran, Taruna Yadav, Tej Pal, Pankaj Gupta
Published 2026-01-06 18:18

The study aims to create RXL-RADSet, a synthetic benchmark for automated RADS assignment from radiology reports, and to evaluate the performance of various small language models (SLMs) against a proprietary model. The dataset includes 1,600 synthetic reports verified by radiologists and covers multiple RADS frameworks. Results indicate that larger SLMs can achieve high validity and accuracy, particularly under guided prompting, although challenges remain with more complex RADS classifications.

Read abstract

Background: Reporting and Data Systems (RADS) standardize radiology risk communication but automated RADS assignment from narrative reports is challenging because of guideline complexity, output-format constraints, and limited benchmarking across RADS frameworks and model sizes. Purpose: To create RXL-RADSet, a radiologist-verified synthetic multi-RADS benchmark, and compare validity and accuracy of open-weight small language models (SLMs) with a proprietary model for RADS assignment. Materials and Methods: RXL-RADSet contains 1,600 synthetic radiology reports across 10 RADS (BI-RADS, CAD-RADS, GB-RADS, LI-RADS, Lung-RADS, NI-RADS, O-RADS, PI-RADS, TI-RADS, VI-RADS) and multiple modalities. Reports were generated by LLMs using scenario plans and simulated radiologist styles and underwent two-stage radiologist verification. We evaluated 41 quantized SLMs (12 families, 0.135-32B parameters) and GPT-5.2 under a fixed guided prompt. Primary endpoints were validity and accuracy; a secondary analysis compared guided versus zero-shot prompting. Results: Under guided prompting GPT-5.2 achieved 99.8% validity and 81.1% accuracy (1,600 predictions). Pooled SLMs (65,600 predictions) achieved 96.8% validity and 61.1% accuracy; top SLMs in the 20-32B range reached ~99% validity and mid-to-high 70% accuracy. Performance scaled with model size (inflection between <1B and >=10B) and declined with RADS complexity primarily due to classification difficulty rather than invalid outputs. Guided prompting improved validity (99.2% vs 96.7%) and accuracy (78.5% vs 69.6%) compared with zero-shot. Conclusion: RXL-RADSet provides a radiologist-verified multi-RADS benchmark; large SLMs (20-32B) can approach proprietary-model performance under guided prompting, but gaps remain for higher-complexity schemes.

ARXIV Cancer: unknown Method: multimodal large language model

AnatomiX, an Anatomy-Aware Grounded Multimodal Large Language Model for Chest X-Ray Interpretation

Anees Ur Rehman Hashmi, Numan Saeed, Christoph Lippert
Published 2026-01-06 17:13

The paper presents AnatomiX, a multitask multimodal large language model designed for anatomically grounded interpretation of chest X-rays. It employs a two-stage approach that first identifies anatomical structures and extracts their features, followed by leveraging a large language model for various downstream tasks. The results indicate that AnatomiX significantly enhances anatomical reasoning, achieving over 25% improvement in performance on several tasks compared to existing methods.

Read abstract

Multimodal medical large language models have shown substantial progress in chest X-ray interpretation but continue to face challenges in spatial reasoning and anatomical understanding. Although existing grounding techniques improve overall performance, they often fail to establish a true anatomical correspondence, resulting in incorrect anatomical understanding in the medical domain. To address this gap, we introduce AnatomiX, a multitask multimodal large language model for anatomically grounded chest X-ray interpretation. Inspired by the radiological workflow, AnatomiX adopts a two stage approach: first, it identifies anatomical structures and extracts their features, and then leverages a large language model to perform diverse downstream tasks such as phrase grounding, report generation, visual question answering, and image understanding. Extensive experiments across multiple benchmarks demonstrate that AnatomiX achieves superior anatomical reasoning and delivers over 25% improvement in performance on anatomy grounding, phrase grounding, grounded diagnosis and grounded captioning tasks compared to existing approaches. Code and pretrained model are available at github.com/aneesurhashmi/anatomix.

ARXIV Cancer: general cancer Method: reinforcement learning

Aligning Findings with Diagnosis: A Self-Consistent Reinforcement Learning Framework for Trustworthy Radiology Reporting

Kun Zhao, Siyuan Dai, Pan Wang, Jifeng Song, Hui Ji, Chenghua Lin, Liang Zhan, Haoteng Tang
Published 2026-01-06 14:17

This paper presents a novel framework for self-consistent radiology report generation using Multimodal Large Language Models (MLLMs). The proposed 'Reason-then-Summarize' architecture, optimized through Group Relative Policy Optimization (GRPO), aims to align linguistic outputs with visual evidence while minimizing factual hallucinations. Experimental results on the MIMIC-CXR benchmark indicate that the method achieves state-of-the-art performance in clinical efficacy metrics.

Read abstract

Multimodal Large Language Models (MLLMs) have shown strong potential for radiology report generation, yet their clinical translation is hindered by architectural heterogeneity and the prevalence of factual hallucinations. Standard supervised fine-tuning often fails to strictly align linguistic outputs with visual evidence, while existing reinforcement learning approaches struggle with either prohibitive computational costs or limited exploration. To address these challenges, we propose a comprehensive framework for self-consistent radiology report generation. First, we conduct a systematic evaluation to identify optimal vision encoder and LLM backbone configurations for medical imaging. Building on this foundation, we introduce a novel "Reason-then-Summarize" architecture optimized via Group Relative Policy Optimization (GRPO). This framework restructures generation into two distinct components: a think block for detailed findings and an answer block for structured disease labels. By utilizing a multi-dimensional composite reward function, we explicitly penalize logical discrepancies between the generated narrative and the final diagnosis. Extensive experiments on the MIMIC-CXR benchmark demonstrate that our method achieves state-of-the-art performance in clinical efficacy metrics and significantly reduces hallucinations compared to strong supervised baselines.

ARXIV Cancer: unknown Method: Swin Transformer U-Net 3D

Lesion Segmentation in FDG-PET/CT Using Swin Transformer U-Net 3D: A Robust Deep Learning Framework

Shovini Guha, Dwaipayan Nandi
Published 2026-01-06 09:52

This paper introduces the Swin Transformer U-Net 3D (SwinUNet3D) framework for automated lesion segmentation in FDG-PET/CT imaging, which is crucial for cancer diagnosis and therapy planning. The model combines shifted window self-attention with U-Net style skip connections to enhance both global context and fine anatomical detail. Evaluation on the AutoPET III FDG dataset shows that SwinUNet3D significantly outperforms the baseline 3D U-Net in terms of Dice score and IoU, while also providing faster inference times.

Read abstract

Accurate and automated lesion segmentation in Positron Emission Tomography / Computed Tomography (PET/CT) imaging is essential for cancer diagnosis and therapy planning. This paper presents a Swin Transformer UNet 3D (SwinUNet3D) framework for lesion segmentation in Fluorodeoxyglucose Positron Emission Tomography / Computed Tomography (FDG-PET/CT) scans. By combining shifted window self-attention with U-Net style skip connections, the model captures both global context and fine anatomical detail. We evaluate SwinUNet3D on the AutoPET III FDG dataset and compare it against a baseline 3D U-Net. Results show that SwinUNet3D achieves a Dice score of 0.88 and IoU of 0.78, surpassing 3D U-Net (Dice 0.48, IoU 0.32) while also delivering faster inference times. Qualitative analysis demonstrates improved detection of small and irregular lesions, reduced false positives, and more accurate PET/CT fusion. While the framework is currently limited to FDG scans and trained under modest GPU resources, it establishes a strong foundation for future multi-tracer, multi-center evaluations and benchmarking against other transformer-based architectures. Overall, SwinUNet3D represents an efficient and robust approach to PET/CT lesion segmentation, advancing the integration of transformer-based models into oncology imaging workflows.

ARXIV Cancer: general cancer Method: graph contrastive learning

Topology-aware Pathological Consistency Matching for Weakly-Paired IHC Virtual Staining

Mingzhou Jiang, Jiaying Zhou, Nan Zeng, Mickael Li, Qijie Tang, Chao He, Huazhu Fu, Honghui He
Published 2026-01-06 08:28

This paper presents a novel framework for virtual staining that converts Hematoxylin and Eosin (H&E) images to immunohistochemical (IHC) images, addressing the challenges of weakly-paired data due to spatial misalignment. The proposed Topology-aware Consistency Matching (TACM) mechanism utilizes graph contrastive learning to ensure structural consistency, while the Topology-constrained Pathological Matching (TCPM) mechanism enhances pathological consistency. Experimental results indicate that the method significantly outperforms existing approaches, achieving higher generation quality and clinical relevance.

Read abstract

Immunohistochemical (IHC) staining provides crucial molecular characterization of tissue samples and plays an indispensable role in the clinical examination and diagnosis of cancers. However, compared with the commonly used Hematoxylin and Eosin (H&E) staining, IHC staining involves complex procedures and is both time-consuming and expensive, which limits its widespread clinical use. Virtual staining converts H&E images to IHC images, offering a cost-effective alternative to clinical IHC staining. Nevertheless, using adjacent slides as ground truth often results in weakly-paired data with spatial misalignment and local deformations, hindering effective supervised learning. To address these challenges, we propose a novel topology-aware framework for H&E-to-IHC virtual staining. Specifically, we introduce a Topology-aware Consistency Matching (TACM) mechanism that employs graph contrastive learning and topological perturbations to learn robust matching patterns despite spatial misalignments, ensuring structural consistency. Furthermore, we propose a Topology-constrained Pathological Matching (TCPM) mechanism that aligns pathological positive regions based on node importance to enhance pathological consistency. Extensive experiments on two benchmarks across four staining tasks demonstrate that our method outperforms state-of-the-art approaches, achieving superior generation quality with higher clinical relevance.

Find the papers that actually matter