Research Papers

ARXIV Cancer: lung cancer Method: denoising diffusion probabilistic models

DiffusionXRay: A Diffusion and GAN-Based Approach for Enhancing Digitally Reconstructed Chest Radiographs

Aryan Goyal, Ashish Mittal, Pranav Rao, Manoj Tadepalli, Preetham Putha
Published 2026-03-02 10:14

The paper presents DiffusionXRay, an innovative image restoration pipeline designed to enhance digitally reconstructed chest radiographs (DRR) for lung cancer diagnosis. By combining denoising diffusion probabilistic models (DDPMs) and generative adversarial networks (GANs), the method addresses the challenges posed by limited high-quality annotated datasets. The results indicate significant improvements in image clarity and diagnostic value, validated through quantitative metrics and expert assessments.

Read abstract

Deep learning-based automated diagnosis of lung cancer has emerged as a crucial advancement that enables healthcare professionals to detect and initiate treatment earlier. However, these models require extensive training datasets with diverse case-specific properties. High-quality annotated data is particularly challenging to obtain, especially for cases with subtle pulmonary nodules that are difficult to detect even for experienced radiologists. This scarcity of well-labeled datasets can limit model performance and generalization across different patient populations. Digitally reconstructed radiographs (DRR) using CT-Scan to generate synthetic frontal chest X-rays with artificially inserted lung nodules offers one potential solution. However, this approach suffers from significant image quality degradation, particularly in the form of blurred anatomical features and loss of fine lung field structures. To overcome this, we introduce DiffusionXRay, a novel image restoration pipeline for Chest X-ray images that synergistically leverages denoising diffusion probabilistic models (DDPMs) and generative adversarial networks (GANs). DiffusionXRay incorporates a unique two-stage training process: First, we investigate two independent approaches, DDPM-LQ and GAN-based MUNIT-LQ, to generate low-quality CXRs, addressing the challenge of training data scarcity, posing this as a style transfer problem. Subsequently, we train a DDPM-based model on paired low-quality and high-quality images, enabling it to learn the nuances of X-ray image restoration. Our method demonstrates promising results in enhancing image clarity, contrast, and overall diagnostic value of chest X-rays while preserving subtle yet clinically significant artifacts, validated by both quantitative metrics and expert radiological assessment.

ARXIV Cancer: lung cancer Method: diffusion model

A Diffusion-Driven Fine-Grained Nodule Synthesis Framework for Enhanced Lung Nodule Detection from Chest Radiographs

Aryan Goyal, Shreshtha Singh, Ashish Mittal, Manoj Tadepalli, Piyush Kumar, Preetham Putha
Published 2026-03-02 09:43

This paper presents a novel diffusion-driven framework for synthesizing lung nodules to enhance their detection in chest radiographs. The proposed method utilizes low-rank adaptation (LoRA) to achieve fine-grained control over nodule characteristics, addressing challenges in data scarcity for training deep learning models. Experimental results indicate that the framework improves nodule detection performance compared to existing methods, with positive evaluations from radiologists on the generated nodules.

Read abstract

Early detection of lung cancer in chest radiographs (CXRs) is crucial for improving patient outcomes, yet nodule detection remains challenging due to their subtle appearance and variability in radiological characteristics like size, texture, and boundary. For robust analysis, this diversity must be well represented in training datasets for deep learning based Computer-Assisted Diagnosis (CAD) systems. However, assembling such datasets is costly and often impractical, motivating the need for realistic synthetic data generation. Existing methods lack fine-grained control over synthetic nodule generation, limiting their utility in addressing data scarcity. This paper proposes a novel diffusion-based framework with low-rank adaptation (LoRA) adapters for characteristic controlled nodule synthesis on CXRs. We begin by addressing size and shape control through nodule mask conditioned training of the base diffusion model. To achieve individual characteristic control, we train separate LoRA modules, each dedicated to a specific radiological feature. However, since nodules rarely exhibit isolated characteristics, effective multi-characteristic control requires a balanced integration of features. We address this by leveraging the dynamic composability of LoRAs and revisiting existing merging strategies. Building on this, we identify two key issues, overlapping attention regions and non-orthogonal parameter spaces. To overcome these limitations, we introduce a novel orthogonality loss term during LoRA composition training. Extensive experiments on both in-house and public datasets demonstrate improved downstream nodule detection. Radiologist evaluations confirm the fine-grained controllability of our generated nodules, and across multiple quantitative metrics, our method surpasses existing nodule generation approaches for CXRs.

ARXIV Cancer: unknown Method: vision-language model

Measuring What VLMs Don't Say: Validation Metrics Hide Clinical Terminology Erasure in Radiology Report Generation

Aditya Parikh, Aasa Feragen, Sneha Das, Stella Frank
Published 2026-03-02 08:59

This paper addresses the limitations of current validation metrics for Vision-Language Models (VLMs) in radiology, particularly their tendency to overlook clinical terminology. It introduces new metrics, Clinical Association Displacement (CAD) and Weighted Association Erasure (WAE), to evaluate the clinical specificity and demographic fairness of generated reports. The findings indicate that deterministic decoding leads to significant semantic erasure, while stochastic sampling, although more diverse, may introduce bias.

Read abstract

Reliable deployment of Vision-Language Models (VLMs) in radiology requires validation metrics that go beyond surface-level text similarity to ensure clinical fidelity and demographic fairness. This paper investigates a critical blind spot in current model evaluation: the use of decoding strategies that lead to high aggregate token-overlap scores despite succumbing to template collapse, in which models generate only repetitive, safe generic text and omit clinical terminology. Unaddressed, this blind spot can lead to metric gaming, where models that perform well on benchmarks prove clinically uninformative. Instead, we advocate for lexical diversity measures to check model generations for clinical specificity. We introduce Clinical Association Displacement (CAD), a vocabulary-level framework that quantifies shifts in demographic-based word associations in generated reports. Weighted Association Erasure (WAE) aggregates these shifts to measure the clinical signal loss across demographic groups. We show that deterministic decoding produces high levels of semantic erasure, while stochastic sampling generates diverse outputs but risks introducing new bias, motivating a fundamental rethink of how "optimal" reporting is defined.

ARXIV Cancer: pediatric brain tumor Method: mixture-of-experts

PathMoE: Interpretable Multimodal Interaction Experts for Pediatric Brain Tumor Classification

Jian Yu, Joakim Nguyen, Jinrui Fang, Awais Naeem, Zeyuan Cao, Sanjay Krishnan, Nicholas Konz, Tianlong Chen, Chandra Krishnan, Hairong Wang, Edward Castillo, Ying Ding, Ankita Shukla
Published 2026-03-02 07:17

This paper presents PathMoE, an interpretable multimodal framework designed for the classification of pediatric central nervous system tumors. The framework integrates whole-slide images, pathology reports, and nuclei-level cell graphs using an interaction-aware mixture-of-experts architecture. Evaluation on pediatric brain tumor and TCGA datasets shows significant performance improvements over traditional image-only methods, highlighting the importance of modality interactions for accurate classification and interpretability.

Read abstract

Accurate classification of pediatric central nervous system tumors remains challenging due to histological complexity and limited training data. While pathology foundation models have advanced whole-slide image (WSI) analysis, they often fail to leverage the rich, complementary information found in clinical text and tissue microarchitecture. To this end, we propose PathMoE, an interpretable multimodal framework that integrates H\&E slides, pathology reports, and nuclei-level cell graphs via an interaction-aware mixture-of-experts architecture built on state-of-the-art foundation models for each modality. By training specialized experts to capture modality uniqueness, redundancy, and synergy, PathMoE employs an input-dependent gating mechanism that dynamically weights these interactions, providing sample-level interpretability. We evaluate our framework on two dataset-specific classification tasks on an internal pediatric brain tumor dataset (PBT) and external TCGA datasets. PathMoE improves macro-F1 from 0.762 to 0.799 (+0.037) on PBT when integrating WSI, text, and graph modalities; on TCGA, augmenting WSI with graph knowledge improves macro-F1 from 0.668 to 0.709 (+0.041). These results demonstrate significant performance gains over state-of-the-art image-only baselines while revealing the specific modality interactions driving individual predictions. This interpretability is particularly critical for rare tumor subtypes, where transparent model reasoning is essential for clinical trust and diagnostic validation.

ARXIV Cancer: pancreatic cancer Method: large language models

PanCanBench: A Comprehensive Benchmark for Evaluating Large Language Models in Pancreatic Oncology

Yimin Zhao, Sheela R. Damle, Simone E. Dekker, Scott Geng, Karly Williams Silva, Jesse J Hubbard, Manuel F Fernandez, Fatima Zelada-Arenas, Alejandra Alvarez, Brianne Flores, Alexis Rodriguez, Stephen Salerno, Carrie Wright, Zihao Wang, Pang Wei Koh, Jeffrey T. Leek
Published 2026-03-02 00:50

The paper presents PanCanBench, a benchmark designed to evaluate large language models (LLMs) in the context of pancreatic oncology. It highlights the limitations of existing evaluation frameworks and introduces a human-in-the-loop pipeline for creating expert rubrics based on real patient questions. The study assesses 22 LLMs, revealing significant variability in their performance regarding clinical completeness and factual accuracy, with notable rates of factual errors and hallucinations.

Read abstract

Large language models (LLMs) have achieved expert-level performance on standardized examinations, yet multiple-choice accuracy poorly reflects real-world clinical utility and safety. As patients and clinicians increasingly use LLMs for guidance on complex conditions such as pancreatic cancer, evaluation must extend beyond general medical knowledge. Existing frameworks, such as HealthBench, rely on simulated queries and lack disease-specific depth. Moreover, high rubric-based scores do not ensure factual correctness, underscoring the need to assess hallucinations. We developed a human-in-the-loop pipeline to create expert rubrics for de-identified patient questions from the Pancreatic Cancer Action Network (PanCAN). The resulting benchmark, PanCanBench, includes 3,130 question-specific criteria across 282 authentic patient questions. We evaluated 22 proprietary and open-source LLMs using an LLM-as-a-judge framework, measuring clinical completeness, factual accuracy, and web-search integration. Models showed substantial variation in rubric-based completeness, with scores ranging from 46.5% to 82.3%. Factual errors were common, with hallucination rates (the percentages of responses containing at least one factual error) ranging from 6.0% for Gemini-2.5 Pro and GPT-4o to 53.8% for Llama-3.1-8B. Importantly, newer reasoning-optimized models did not consistently improve factuality: although o3 achieved the highest rubric score, it produced inaccuracies more frequently than other GPT-family models. Web-search integration did not inherently guarantee better responses. The average score changed from 66.8% to 63.9% for Gemini-2.5 Pro and from 73.8% to 72.8% for GPT-5 when web search was enabled. Synthetic AI-generated rubrics inflated absolute scores by 17.9 points on average while generally maintaining similar relative ranking.

ARXIV Cancer: breast cancer Method: deep learning

The MAMA-MIA Challenge: Advancing Generalizability and Fairness in Breast MRI Tumor Segmentation and Treatment Response Prediction

Lidia Garrucho, Smriti Joshi, Kaisar Kushibar, Richard Osuala, Maciej Bobowicz, Xavier Bargalló, Paulius Jaruševičius, Kai Geissler, Raphael Schäfer, Muhammad Alberb, Tony Xu, Anne Martel, Daniel Sleiman, Navchetan Awasthi, Hadeel Awwad, Joan C. Vilanova, Robert Martí, Daan Schouten, Jeong Hoon Lee, Mirabela Rusu, Eleonora Poeta, Luisa Vargas, Eliana Pastor, Maria A. Zuluaga, Jessica Kächele, Dimitrios Bounias, Alexandra Ertl, Katarzyna Gwoździewicz, Maria-Laura Cosaka, Pasant M. Abo-Elhoda, Sara W. Tantawy, Shorouq S. Sakrana, Norhan O. Shawky-Abdelfatah, Amr Muhammad Abdo-Salem, Androniki Kozana, Eugen Divjak, Gordana Ivanac, Katerina Nikiforaki, Michail E. Klontzas, Rosa García-Dosdá, Meltem Gulsun-Akpinar, Oğuz Lafcı, Carlos Martín-Isla, Oliver Díaz, Laura Igual, Karim Lekadir
Published 2026-03-01 20:06

The MAMA-MIA Challenge aimed to enhance the generalizability and fairness of artificial intelligence models in breast MRI tumor segmentation and treatment response prediction. By utilizing a large-scale benchmark with data from multiple institutions, the challenge evaluated the performance of AI systems across diverse demographic subgroups. Results indicated significant variability in performance under external testing, highlighting the need for equitable AI systems in breast cancer imaging.

Read abstract

Breast cancer is the most frequently diagnosed malignancy among women worldwide and a leading cause of cancer-related mortality. Dynamic contrast-enhanced magnetic resonance imaging plays a central role in tumor characterization and treatment monitoring, particularly in patients receiving neoadjuvant chemotherapy. However, existing artificial intelligence models for breast magnetic resonance imaging are often developed using single-center data and evaluated using aggregate performance metrics, limiting their generalizability and obscuring potential performance disparities across demographic subgroups. The MAMA-MIA Challenge was designed to address these limitations by introducing a large-scale benchmark that jointly evaluates primary tumor segmentation and prediction of pathologic complete response using pre-treatment magnetic resonance imaging only. The training cohort comprised 1,506 patients from multiple institutions in the United States, while evaluation was conducted on an external test set of 574 patients from three independent European centers to assess cross-continental and cross-institutional generalization. A unified scoring framework combined predictive performance with subgroup consistency across age, menopausal status, and breast density. Twenty-six international teams participated in the final evaluation phase. Results demonstrate substantial performance variability under external testing and reveal trade-offs between overall accuracy and subgroup fairness. The challenge provides standardized datasets, evaluation protocols, and public resources to promote the development of robust and equitable artificial intelligence systems for breast cancer imaging.

ARXIV Cancer: colorectal cancer Method: neuro-symbolic framework

Contextual Invertible World Models: A Neuro-Symbolic Agentic Framework for Colorectal Cancer Drug Response

Christopher Baker, Karen Rafferty, Hui Wang
Published 2026-03-01 16:15

This paper presents a Neuro-Symbolic Agentic Framework aimed at improving precision oncology for colorectal cancer drug response. It integrates a quantitative machine learning World Model with an LLM-based reasoning layer to address the challenges posed by high-dimensional genomic data and sparse drug response samples. The framework achieves a predictive correlation of 0.504 and enhances clinical decision-making by modeling the clinical context, particularly Microsatellite Instability (MSI) status. The approach also includes in silico CRISPR perturbations to predict drug sensitivity alterations due to specific genomic edits.

Read abstract

Precision oncology is currently limited by the small-N, large-P paradox, where high-dimensional genomic data is abundant, but high-quality drug response samples are often sparse. While deep learning models achieve high predictive accuracy, they remain black boxes that fail to provide the causal mechanisms required for clinical decision-making. We present a Neuro-Symbolic Agentic Framework that bridges this gap by integrating a quantitative machine learning World Model with an LLM-based agentic reasoning layer. Our system utilises a forensic data pipeline built on the Sanger GDSC dataset (N=83), achieving a robust predictive correlation (r=0.504) and a significant performance gain through the explicit modelling of clinical context, specifically Microsatellite Instability (MSI) status. We introduce the concept of Inverse Reasoning, where the agentic layer performs in silico CRISPR perturbations to predict how specific genomic edits, such as APC or TP53 repair, alter drug sensitivity. By distinguishing between therapeutic opportunity and contextual resistance, and validating these findings against human clinical data (p=0.023), our framework provides a transparent, biologically grounded path towards explainable AI in cancer research.

ARXIV Cancer: unknown Method: conformal prediction

Conformal Prediction for Risk-Controlled Medical Entity Extraction Across Clinical Domains

Manil Shrestha, Edward Kim
Published 2026-03-01 05:12

This study introduces a conformal prediction framework aimed at improving the calibration of confidence scores in large language models (LLMs) used for medical entity extraction. The framework was applied to extract structured entities from FDA drug labels and radiological entities from MIMIC-CXR reports, achieving high accuracy and F1 scores. The findings reveal that model miscalibration varies across different clinical domains, necessitating domain-specific calibration strategies for safe clinical deployment.

Read abstract

Large Language Models (LLMs) are increasingly used for medical entity extraction, yet their confidence scores are often miscalibrated, limiting safe deployment in clinical settings. We present a conformal prediction framework that provides finite-sample coverage guarantees for LLM-based extraction across two clinical domains. First, we extract structured entities from 1,000 FDA drug labels across eight sections using GPT-4.1, verified via FactScore-based atomic statement evaluation (97.7\% accuracy over 128,906 entities). Second, we extract radiological entities from MIMIC-CXR reports using the RadGraph schema with GPT-4.1 and Llama-4-Maverick, evaluated against physician annotations (entity F1: 0.81 to 0.84). Our central finding is that miscalibration direction reverses across domains: on well-structured FDA labels, models are underconfident, requiring modest conformal thresholds ($τ\approx 0.06$), while on free-text radiology reports, models are overconfident, demanding strict thresholds ($τ$ up to 0.99). Despite this heterogeneity, conformal prediction achieves target coverage ($\geq 90\%$) in both settings with manageable rejection rates (9--13\%). These results demonstrate that calibration is not a global model property but depends on document structure, extraction category, and model architecture, motivating domain-specific conformal calibration for safe clinical deployment.

ARXIV Cancer: colorectal cancer Method: machine learning

Identifying and Characterising Response in Clinical Trials: Development and Validation of a Machine Learning Approach in Colorectal Cancer

Adam Marcus, Paul Agapow
Published 2026-02-28 18:00

This study presents a machine learning approach aimed at identifying and characterizing patient responses in clinical trials for colorectal cancer. The method integrates partly conditional modeling with treatment effect estimation using the Virtual Twins method, and employs survLIME for analyzing time-specific treatment responses. The approach demonstrated an area under the receiver operating characteristic curve (AUC) of 0.77 for identifying fixed responders in a simulation, with improvements noted for dynamic responders. The findings highlight the significance of genetic mutations, metastasis sites, and ethnicity in treatment response.

Read abstract

Precision medicine promises to transform health care by offering individualised treatments that dramatically improve clinical outcomes. A necessary prerequisite is to identify subgroups of patients who respond differently to different therapies. Current approaches are limited to static measures of treatment success, neglecting the repeated measures found in most clinical trials. Our approach combines the concept of partly conditional modelling with treatment effect estimation based on the Virtual Twins method. The resulting time-specific responses to treatment are characterised using survLIME, an extension of Local Interpretable Model-agnostic Explanations (LIME) to survival data. Performance was evaluated using synthetic data and applied to clinical trials examining the effectiveness of panitumumab to treat metastatic colorectal cancer. An area under the receiver operating characteristic curve (AUC) of 0.77 for identifying fixed responders was achieved in a 1000 patient simulation. When considering dynamic responders, partly conditional modelling increased the AUC from 0.597 to 0.685. Applying the approach to colorectal cancer trials found genetic mutations, sites of metastasis, and ethnicity as important factors for response to treatment. Our approach can accommodate a dynamic response to treatment while potentially providing better performance than existing methods in instances of a fixed response to treatment. When applied to clinical data we attain results consistent with the literature.

ARXIV Cancer: gastric cancer Method: hierarchical classification

Hierarchical Classification for Improved Histopathology Image Analysis

Keunho Byeon, Jinsol Song, Seong Min Hong, Yosep Chong, Jin Tae Kwak
Published 2026-02-28 07:01

This study presents HiClass, a hierarchical classification framework designed to enhance histopathology image analysis by addressing the limitations of flat classification methods. The framework employs a multiple instance learning approach with bidirectional feature integration to improve the learning of hierarchical features. Tailored loss functions are introduced to optimize the classification process, and the method is evaluated on a gastric biopsy dataset, demonstrating superior performance in both coarse-grained and fine-grained classification tasks.

Read abstract

Whole-slide image analysis is essential for diagnostic tasks in pathology, yet existing deep learning methods primarily rely on flat classification, ignoring hierarchical relationships among class labels. In this study, we propose HiClass, a hierarchical classification framework for improved histopathology image analysis, that enhances both coarse-grained and fine-grained WSI classification. Built based upon a multiple instance learning approach, HiClass extends it by introducing bidirectional feature integration that facilitates information exchange between coarse-grained and fine-grained feature representations, effectively learning hierarchical features. Moreover, we introduce tailored loss functions, including hierarchical consistency loss, intra- and inter-class distance loss, and group-wise cross-entropy loss, to further optimize hierarchical learning. We assess the performance of HiClass on a gastric biopsy dataset with 4 coarse-grained and 14 fine-grained classes, achieving superior classification performance for both coarse-grained classification and fine-grained classification. These results demonstrate the effectiveness of HiClass in improving WSI classification by capturing coarse-grained and fine-grained histopathological characteristics.

Find the papers that actually matter