Research Papers

ARXIV Cancer: general cancer Method: vision-language model

Suppressing Prior-Comparison Hallucinations in Radiology Report Generation via Semantically Decoupled Latent Steering

Ao Li, Rui Liu, Mingjie Li, Sheng Liu, Lei Wang, Xiaodan Liang, Lina Yao, Xiaojun Chang, Lei Xing
Published 2026-02-27 04:49

This paper presents a novel framework called Semantically Decoupled Latent Steering (SDLS) to address the issue of prior-comparison hallucination in automated radiology report generation using vision-language models. The method employs a training-free, inference-time control strategy that utilizes large language model-driven semantic decomposition and orthogonalization to filter out clinical semantics. Validation on the BiomedGPT foundation model shows significant improvements in reducing hallucinations and enhancing clinical accuracy across various datasets.

Read abstract

Automated radiology report generation using vision-language models (VLMs) is limited by the risk of prior-comparison hallucination, where the model generates historical findings unsupported by the current study. We address this challenge with a training-free, inference-time control framework termed Semantically Decoupled Latent Steering (SDLS). Unlike generic activation steering, which often suffers from semantic entanglement, our approach constructs a semantic-free intervention vector via large language model (LLM)-driven semantic decomposition followed by $QR$-based orthogonalization. This orthogonalization step is critical. It leverages geometric constraints to filter out the clinical semantics often entangled in standard principal component analysis (PCA) directions, ensuring that the steering vector targets only the ``historical comparison" axis. We validate our method on the BiomedGPT foundation model, demonstrating that it overcomes the trade-off between hallucination suppression and clinical accuracy. Extensive experiments on MIMIC-CXR, and zero-shot transfer evaluation on CheXpert Plus and IU-Xray, demonstrate the robustness of our approach. Quantitative evaluations on MIMIC-CXR show that our approach significantly reduces the probability of historical hallucinations (FilBERT score decreases from 0.2373 to 0.1889) and improves clinical label fidelity (CheXpert macro-F1 increases from 0.2242 to 0.3208). Supplementary evaluations confirm that the structural integrity of the clinical narrative is maintained.

ARXIV Cancer: general cancer Method: multimodal learning

Multimodal Alignment Improves Generalizability of Genomic Biomarker Prediction in Computational Pathology

Ekaterina Redekop, Eric Zimmermann, Ava P Amini, Alex X Lu, Neil Tenenholtz, James Brian Hall, Lorin Crawford, Kristen A Severson
Published 2026-02-27 03:55

This study presents MARBLE, a multimodal contrastive pretraining strategy aimed at enhancing the generalizability of genomic biomarker prediction in computational pathology. By integrating structured biomarker knowledge with histopathology image representation learning, MARBLE aligns these representations with those generated by large language models and protein language models. The approach is validated using a substantial dataset from the MSK-IMPACT cohort, showcasing its effectiveness in adapting to novel biomarkers.

Read abstract

Computational pathology models that use digitized histopathology whole-slide images have the potential to become a cost-effective and scalable alternative to molecular assays for the prediction of genomic biomarkers, a key task in precision oncology. However, as new genomic biomarkers are discovered or quantified, large, labeled datasets must be prospectively collected to train new models. To address this challenge, we developed MARBLE, a multimodal contrastive pretraining strategy that integrates structured biomarker knowledge into representation learning of histopathology images. MARBLE aligns histopathology-derived representations with representations of genomic biomarkers generated by a large language model (LLM) and a protein language model (PLM). This biologically informed alignment enables data-efficient generalization to novel, out-of-distribution biomarkers. Using the MSK-IMPACT cohort of over 40,000 patients across multiple biomarker panel versions, we design experiments grounded in real-world data to demonstrate the value of our proposed approach.

ARXIV Cancer: kidney cancer Method: Hierarchical Multi-scale Knowledge-aware Graph Network

Hierarchical Multi-Scale Graph Learning with Knowledge-Guided Attention for Whole-Slide Image Survival Analysis

Bin Xu, Yufei Zhou, Boling Song, Jingwen Sun, Yang Bian, Cheng Lu, Ye Wu, Jianfei Tu, Xiangxue Wang
Published 2026-02-26 23:47

This study introduces the Hierarchical Multi-scale Knowledge-aware Graph Network (HMKGN) for analyzing whole-slide images (WSIs) in cancer prognostication. The method incorporates a hierarchical structure with spatial locality constraints to enhance the modeling of multi-scale interactions. Evaluations on four TCGA cohorts demonstrate that HMKGN significantly improves survival prediction compared to existing models, achieving better concordance indices and meaningful stratification of patient survival risk.

Read abstract

We propose a Hierarchical Multi-scale Knowledge-aware Graph Network (HMKGN) that models multi-scale interactions and spatially hierarchical relationships within whole-slide images (WSIs) for cancer prognostication. Unlike conventional attention-based MIL, which ignores spatial organization, or graph-based MIL, which relies on static handcrafted graphs, HMKGN enforces a hierarchical structure with spatial locality constraints, wherein local cellular-level dynamic graphs aggregate spatially proximate patches within each region of interest (ROI) and a global slide-level dynamic graph integrates ROI-level features into WSI-level representations. Moreover, multi-scale integration at the ROI level combines coarse contextual features from broader views with fine-grained structural representations from local patch-graph aggregation. We evaluate HMKGN on four TCGA cohorts (KIRC, LGG, PAAD, and STAD; N=513, 487, 138, and 370) for survival prediction. It consistently outperforms existing MIL-based models, yielding improved concordance indices (10.85% better) and statistically significant stratification of patient survival risk (log-rank p < 0.05).

ARXIV Cancer: brain cancer Method: Low-Rank Adaptation

Few-Shot Continual Learning for 3D Brain MRI with Frozen Foundation Models

Chi-Sheng Chen, Xinyu Zhang, Guan-Ying Chen, Qiuzhe Xie, Fan Zhang, En-Jui Kuo
Published 2026-02-26 22:31

This study presents a few-shot continual learning approach for 3D brain MRI, utilizing a frozen pretrained backbone combined with task-specific Low-Rank Adaptation (LoRA) modules. The method addresses challenges in adapting to sequential tasks, specifically tumor segmentation and brain age estimation, while preventing catastrophic forgetting. The proposed LoRA approach demonstrates balanced performance across tasks with minimal trainable parameters, achieving significant results in maintaining task performance.

Read abstract

Foundation models pretrained on large-scale 3D medical imaging data face challenges when adapted to multiple downstream tasks under continual learning with limited labeled data. We address few-shot continual learning for 3D brain MRI by combining a frozen pretrained backbone with task-specific Low-Rank Adaptation (LoRA) modules. Tasks arrive sequentially -- tumor segmentation (BraTS) and brain age estimation (IXI) -- with no replay of previous task data. Each task receives a dedicated LoRA adapter; only the adapter and task-specific head are trained while the backbone remains frozen, thereby eliminating catastrophic forgetting by design (BWT=0). In continual learning, sequential full fine-tuning suffers severe forgetting (T1 Dice drops from 0.80 to 0.16 after T2), while sequential linear probing achieves strong T1 (Dice 0.79) but fails on T2 (MAE 1.45). Our LoRA approach achieves the best balanced performance across both tasks: T1 Dice 0.62$\pm$0.07, T2 MAE 0.16$\pm$0.05, with zero forgetting and $<$0.1\% trainable parameters per task, though with noted systematic age underestimation in T2 (Wilcoxon $p<0.001$). Frozen foundation models with task-specific LoRA adapters thus offer a practical solution when both tasks must be maintained under few-shot continual learning.

ARXIV Cancer: brain tumor Method: multimodal learning

MM-NeuroOnco: A Multimodal Benchmark and Instruction Dataset for MRI-Based Brain Tumor Diagnosis

Feng Guo, Jiaxiang Liu, Yang Li, Qianqian Shi, Mingkun Xu
Published 2026-02-26 12:50

The paper presents MM-NeuroOnco, a comprehensive multimodal benchmark and instruction dataset aimed at enhancing MRI-based brain tumor diagnosis. It includes 24,726 MRI slices and approximately 200,000 semantically enriched instructions to improve diagnostic reasoning. The study also introduces a multi-model collaborative pipeline for automated medical information completion, addressing the limitations of existing datasets. Evaluation results indicate significant challenges in achieving high accuracy in diagnostic understanding, with the proposed NeuroOnco-GPT showing a notable improvement in performance.

Read abstract

Accurate brain tumor diagnosis requires models to not only detect lesions but also generate clinically interpretable reasoning grounded in imaging manifestations, yet existing public datasets remain limited in annotation richness and diagnostic semantics. To bridge this gap, we introduce MM-NeuroOnco, a large-scale multimodal benchmark and instruction-tuning dataset for brain tumor MRI understanding, consisting of 24,726 MRI slices from 20 data sources paired with approximately 200,000 semantically enriched multimodal instructions spanning diverse tumor subtypes and imaging modalities. To mitigate the scarcity and high cost of diagnostic semantic annotations, we develop a multi-model collaborative pipeline for automated medical information completion and quality control, enabling the generation of diagnosis-related semantics beyond mask-only annotations. Building upon this dataset, we further construct MM-NeuroOnco-Bench, a manually annotated evaluation benchmark with a rejection-aware setting to reduce biases inherent in closed-ended question formats. Evaluation across ten representative models shows that even the strongest baseline, Gemini 3 Flash, achieves only 41.88% accuracy on diagnosis-related questions, highlighting the substantial challenges of multimodal brain tumor diagnostic understanding. Leveraging MM-NeuroOnco, we further propose NeuroOnco-GPT, which achieves a 27% absolute accuracy improvement on diagnostic questions following fine-tuning. This result demonstrates the effectiveness of our dataset and benchmark in advancing clinically grounded multimodal diagnostic reasoning. Code and dataset are publicly available at: https://github.com/gfnnnb/MM-NeuroOnco

ARXIV Cancer: unknown Method: foundation model

A data- and compute-efficient chest X-ray foundation model beyond aggressive scaling

Chong Wang, Yabin Zhang, Yunhe Gao, Maya Varma, Clemence Mottez, Faidra Patsatzi, Jiaming Liu, Jin Long, Jean-Benoit Delbrouck, Sergios Gatidis, Akshay S. Chaudhari, Curtis P. Langlotz
Published 2026-02-26 10:32

This study presents CheXficient, a chest X-ray foundation model designed to improve efficiency in medical imaging by employing active data curation during pretraining. By selectively prioritizing informative training samples, CheXficient is pretrained on a significantly reduced dataset while achieving comparable or superior performance to larger models. The model demonstrates enhanced generalizability, particularly for under-represented conditions, across various benchmarks and tasks.

Read abstract

Foundation models for medical imaging are typically pretrained on increasingly large datasets, following a "scale-at-all-costs" paradigm. However, this strategy faces two critical challenges: large-scale medical datasets often contain substantial redundancy and severe class imbalance that bias representation learning toward over-represented patterns, and indiscriminate training regardless of heterogeneity in data quality incurs considerable computational inefficiency. Here we demonstrate that active, principled data curation during pretraining can serve as a viable, cost-effective alternative to brute-force dataset enlargement. We introduce CheXficient, a chest X-ray (CXR) foundation model that selectively prioritizes informative training samples. CheXficient is pretrained on only 22.7% of 1,235,004 paired CXR images and reports while consuming under 27.3% of the total compute budget, yet achieving comparable or superior performance to its full-data counterpart and other large-scale pretrained models. We assess CheXficient across 20 individual benchmarks spanning 5 task types, including non-adapted off-the-shelf evaluations (zero-shot findings classification and crossmodal retrieval) and adapted downstream tasks (disease prediction, semantic segmentation, and radiology report generation). Further analyses show that CheXficient systematically prioritizes under-represented training samples, improving generalizability on long-tailed or rare conditions. Overall, our work offers practical insights into the data and computation demands for efficient pretraining and downstream adaptation of medical vision-language foundation models.

ARXIV Cancer: general cancer Method: vision transformer

GazeXPErT: An Expert Eye-tracking Dataset for Interpretable and Explainable AI in Oncologic FDG-PET/CT Scans

Joy T Wu, Daniel Beckmann, Sarah Miller, Alexander Lee, Elizabeth Theng, Stephan Altmayer, Ken Chang, David Kersting, Tomoaki Otani, Brittany Z Dashevsky, Hye Lim Park, Matteo Novello, Kip Guja, Curtis Langlotz, Ismini Lourentzou, Daniel Gruhl, Benjamin Risse, Guido A Davidzon
Published 2026-02-26 04:39

The study introduces GazeXPErT, a novel eye-tracking dataset aimed at enhancing interpretability and explainability in AI applications for oncologic FDG-PET/CT scans. It captures expert gaze patterns during tumor detection, providing a resource for improving machine learning models. Validation experiments indicate that incorporating expert gaze patterns significantly enhances the performance of a 3D nnUNet tumor segmentation model and improves dynamic lesion localization using vision transformers.

Read abstract

[18F]FDG-PET/CT is a cornerstone imaging modality for tumor staging and treatment response assessment across many cancer types, yet expert reader shortages necessitate more efficient diagnostic aids. While standalone AI models for automatic lesion segmentation exist, clinical translation remains hindered by concerns about interpretability, explainability, reliability, and workflow integration. We present GazeXPErT, a 4D eye-tracking dataset capturing expert search patterns during tumor detection and measurement on 346 FDG-PET/CT scans. Each study was read by a trainee and a board-certified nuclear medicine or radiology specialist using an eye-tracking-enabled annotation platform that simulates routine clinical reads. From 3,948 minutes of raw 60Hz eye-tracking data, 9,030 unique gaze-to-lesion trajectories were extracted, synchronized with PET/CT image slices, and rendered in COCO-style format for multiple machine learning applications. Baseline validation experiments demonstrate that a 3D nnUNet tumor segmentation model achieved superior performance when incorporating expert gaze patterns versus without (DICE score 0.6819 versus 0.6008), and that vision transformers trained on sequential gaze and PET/CT images can improve dynamic lesion localization (74.95% predicted gaze point closer to tumor) and expert intention prediction (Accuracy 67.53% and AUROC 0.747). GazeXPErT is a valuable resource designed to explore multiple machine learning problems beyond these baseline experiments, which include and are not limited to, visual grounding or causal reasoning, clinically explainable feature augmentation, human-computer interaction, human intention prediction or understanding, and expert gaze-rewarded modeling approaches to AI in oncologic FDG-PET/CT imaging.

ARXIV Cancer: renal cancer Method: deep learning

Enhancing Renal Tumor Malignancy Prediction: Deep Learning with Automatic 3D CT Organ Focused Attention

Zhengkang Fan, Chengkun Sun, Russell Terry, Jie Xu, Longin Jan Latecki
Published 2026-02-25 20:26

This study presents a deep learning framework designed to predict malignancy in renal tumors using 3D CT images without the need for manual segmentation. The framework employs an Organ Focused Attention (OFA) loss function to enhance the attention of image patches, leading to improved predictive performance. Results indicate that the proposed method outperforms traditional segmentation-based models, achieving an AUC of 0.760 and an F1-score of 0.852 on the KiTS21 dataset.

Read abstract

Accurate prediction of malignancy in renal tumors is crucial for informing clinical decisions and optimizing treatment strategies. However, existing imaging modalities lack the necessary accuracy to reliably predict malignancy before surgical intervention. While deep learning has shown promise in malignancy prediction using 3D CT images, traditional approaches often rely on manual segmentation to isolate the tumor region and reduce noise, which enhances predictive performance. Manual segmentation, however, is labor-intensive, costly, and dependent on expert knowledge. In this study, a deep learning framework was developed utilizing an Organ Focused Attention (OFA) loss function to modify the attention of image patches so that organ patches attend only to other organ patches. Hence, no segmentation of 3D renal CT images is required at deployment time for malignancy prediction. The proposed framework achieved an AUC of 0.685 and an F1-score of 0.872 on a private dataset from the UF Integrated Data Repository (IDR), and an AUC of 0.760 and an F1-score of 0.852 on the publicly available KiTS21 dataset. These results surpass the performance of conventional models that rely on segmentation-based cropping for noise reduction, demonstrating the frameworks ability to enhance predictive accuracy without explicit segmentation input. The findings suggest that this approach offers a more efficient and reliable method for malignancy prediction, thereby enhancing clinical decision-making in renal cancer diagnosis.

ARXIV Cancer: general cancer Method: deep learning

Enabling clinical use of foundation models in histopathology

Audun L. Henriksen, Ole-Johan Skrede, Lisa van der Schee, Enric Domingo, Sepp De Raedt, Ilyá Kostolomov, Jennifer Hay, Karolina Cyll, Wanja Kildal, Joakim Kalsnes, Robert W. Williams, Manohar Pradhan, John Arne Nesheim, Hanne A. Askautrud, Maria X. Isaksen, Karmele Saez de Gordoa, Miriam Cuatrecasas, Joanne Edwards, TransSCOT group, Arild Nesbakken, Neil A. Shepherd, Ian Tomlinson, Daniel-Christoph Wagner, Rachel S. Kerr, Tarjei Sveinsgjerd Hveem, Knut Liestøl, Yoshiaki Nakamura, Marco Novelli, Masaaki Miyo, Sebastian Foersch, David N. Church, Miangela M. Lacle, David J. Kerr, Andreas Kleppe
Published 2026-02-25 19:13

This study investigates the application of foundation models in histopathology to enhance the robustness and accuracy of deep learning systems for cancer diagnostics. By introducing novel robustness losses during the training of task-specific models, the authors demonstrate a significant reduction in sensitivity to technical variability. The approach utilizes a large dataset of whole slide images and shows that focusing on biologically relevant features leads to improved prediction accuracy. This method allows for the development of reliable computational pathology models suitable for clinical use.

Read abstract

Foundation models in histopathology are expected to facilitate the development of high-performing and generalisable deep learning systems. However, current models capture not only biologically relevant features, but also pre-analytic and scanner-specific variation that bias the predictions of task-specific models trained from the foundation model features. Here we show that introducing novel robustness losses during training of downstream task-specific models reduces sensitivity to technical variability. A purpose-designed comprehensive experimentation setup with 27,042 WSIs from 6155 patients is used to train thousands of models from the features of eight popular foundation models for computational pathology. In addition to a substantial improvement in robustness, we observe that prediction accuracy improves by focusing on biologically relevant features. Our approach successfully mitigates robustness issues of foundation models for computational pathology without retraining the foundation models themselves, enabling development of robust computational pathology models applicable to real-world data in routine clinical practice.

ARXIV Cancer: general cancer Method: mixed magnification aggregation

Mixed Magnification Aggregation for Generalizable Region-Level Representations in Computational Pathology

Eric Zimmermann, Julian Viret, Michal Zelechowski, James Brian Hall, Neil Tenenholtz, Adam Casson, George Shaikovski, Eugene Vorontsov, Siqi Liu, Kristen A Severson
Published 2026-02-25 18:23

This study presents a mixed magnification aggregation method aimed at improving region-level representations in computational pathology. The proposed approach fuses image tile representations from a mixed magnification foundation model to enhance the predictive performance for biomarker tasks across various cancer types. The results indicate that incorporating spatial context significantly benefits the analysis of whole slide images.

Read abstract

In recent years, a standard computational pathology workflow has emerged where whole slide images are cropped into tiles, these tiles are processed using a foundation model, and task-specific models are built using the resulting representations. At least 15 different foundation models have been proposed, and the vast majority are trained exclusively with tiles using the 20$\times$ magnification. However, it is well known that certain histologic features can only be discerned with larger context windows and requires a pathologist to zoom in and out when analyzing a whole slide image. Furthermore, creating 224$\times$224 pixel crops at 20$\times$ leads to a large number of tiles per slide, which can be gigapixel in size. To more accurately capture multi-resolution features and investigate the possibility of reducing the number of representations per slide, we propose a region-level mixing encoder. Our approach jointly fuses image tile representations of a mixed magnification foundation model using a masked embedding modeling pretraining step. We explore a design space for pretraining the proposed mixed-magnification region aggregators and evaluate our models on transfer to biomarker prediction tasks representing various cancer types. Results demonstrate cancer dependent improvements in predictive performance, highlighting the importance of spatial context and understanding.

Find the papers that actually matter