Research Papers

ARXIV Cancer: brain tumor Method: lightweight network

M\textsuperscript{4}Fuse: Lightweight State-Space MoE with a Cross-Scale Gating Bridge for Brain Tumor Segmentation

Meihua Zhou, Xinyu Tong, Li Yang
Published 2026-05-04 10:45

The paper presents M extsuperscript{4}Fuse, a lightweight network designed for brain tumor segmentation that addresses the limitations of existing 3D models. By balancing encoder and decoder capacities and utilizing a cross-scale dual-stage gating bridge, the method enhances performance while significantly reducing parameter count. Experimental results on the BraTS2019 and BraTS2021 benchmarks demonstrate that M extsuperscript{4}Fuse outperforms other lightweight models in both efficiency and accuracy.

Read abstract

Encoder-decoder imbalance and the reliance on large input volumes make many 3D brain tumor segmentation models both compute-heavy and brittle. We present M\textsuperscript{4}Fuse, a lightweight network that prioritizes discriminative brain tumor cues over exhaustive appearance reconstruction. Our method balances encoder and decoder capacity and replaces depth expansion with a synergistic design: it propagates long-range context with linear complexity via a grouped state space mixer, denoises and aligns skip features using a cross-scale dual-stage gating bridge, and absorbs cross-site acquisition shifts with a sample-level mixture-of-experts. On the BraTS2019 and BraTS2021 benchmarks, M\textsuperscript{4}Fuse outperforms other lightweight excellent methods in both parameter count and performance. Even at a challenging input resolution of \(64\times128\times128\) (half that of existing excellent models), M\textsuperscript{4}Fuse reduces parameters by 62.63\% and improves average performance by 0.09\%. Ablations of key components validate the method's exceptional parameter-to-accuracy efficiency and robustness across diverse data centers.

ARXIV Cancer: unknown Method: deep learning

Multi-Rater Calibrated Segmentation Models

Meritxell Riera-Marín, Javier García López, Júlia Rodríguez-Comas, Miguel A. González Ballester, Adrian Galdran
Published 2026-05-04 10:35

This paper addresses the challenge of accurate probability estimates in medical image segmentation models, particularly in the context of multiple expert annotations with significant disagreement. The authors propose a novel approach that reformulates multi-rater supervision as an ordinal learning problem, linking predictive confidence to the variability in training data. Their method, which incorporates ordinal-aware scoring rules, demonstrates improved calibration of segmentation models without compromising accuracy across various medical imaging benchmarks.

Read abstract

Objective: Accurate probability estimates are essential for the safe deployment of medical image segmentation models in clinical decision-making. However, modern deep segmentation networks are often poorly calibrated, a problem exacerbated when multiple expert annotations exhibit substantial disagreement. While inter-rater variability is typically treated as noise, it provides valuable information about intrinsic annotation ambiguity that must be reflected in model confidence. Methods: We improve the probabilistic calibration of medical image segmentation models by reformulating multi-rater supervision as an ordinal learning problem. Voxel-wise annotator agreement is treated as an ordered target, linking predictive confidence to the empirical variability in training data. This formulation allows the use of ordinal-aware scoring rules, such as the Ranked Probability Score ordinal loss, combined with a standard binary objective to preserve discriminative performance. Results: We evaluated the proposed approach across four public segmentation benchmarks spanning ophthalmology, histopathology, and thoracic imaging. Calibration was assessed using a multi-rater extension of expected calibration error. Results consistently show that ordinal-aware training yields substantially improved calibration with respect to inter-rater agreement without degrading segmentation accuracy. Conclusions: Treating multi-rater annotations as ordered information provides a principled and architecture-agnostic route to more reliable probabilistic segmentation models.

ARXIV Cancer: oncology Method: large language model

PhysicianBench: Evaluating LLM Agents in Real-World EHR Environments

Ruoqi Liu, Imran Q. Mohiuddin, Austin J. Schoeffler, Kavita Renduchintala, Ashwin Nayak, Prasantha L. Vemu, Shivam C. Vedak, Kameron C. Black, John L. Havlik, Isaac Ogunmola, Stephen P. Ma, Roopa Dhatt, Jonathan H. Chen
Published 2026-05-04 05:32

PhysicianBench is a benchmark designed to evaluate large language model (LLM) agents on physician tasks within electronic health record (EHR) environments. It includes 100 long-horizon tasks based on real clinical consultations, requiring complex workflows and interactions with patient data. The benchmark reveals a significant gap in the performance of current LLM agents, with the best model achieving only a 46% success rate in completing tasks. This tool aims to enhance the development of autonomous clinical agents by providing a realistic evaluation framework.

Read abstract

We introduce PhysicianBench, a benchmark for evaluating LLM agents on physician tasks grounded in real clinical setting within electronic health record (EHR) environments. Existing medical agent benchmarks primarily focus on static knowledge recall, single-step atomic actions, or action intent without verifiable execution against the environment. As a result, they fail to capture the long-horizon, composite workflows that characterize real clinical systems. PhysicianBench comprises 100 long-horizon tasks adapted from real consultation cases between primary care and subspecialty physicians, with each task independently reviewed by a separate panel of physicians. Tasks are instantiated in an EHR environment with real patient records and accessed through the same standard APIs used by commercial EHR vendors. Tasks span 21 specialties (e.g., cardiology, endocrinology, oncology, psychiatry) and diverse workflow types (e.g., diagnosis interpretation, medication prescribing, treatment planning), requiring an average of 27 tool calls per task. Solving each task requires retrieving data across encounters, reasoning over heterogeneous clinical information, executing consequential clinical actions, and producing clinical documentation. Each task is decomposed into structured checkpoints (670 in total across the benchmark) capturing distinct stages of completion graded by task-specific scripts with execution-grounded verification. Across 13 proprietary and open-source LLM agents, the best-performing model achieves only 46% success rate (pass@1), while open-source models reach at most 19%, revealing a substantial gap between current agent capabilities and the demands of real-world clinical workflows. PhysicianBench provides a realistic and execution-grounded benchmark for measuring progress toward autonomous clinical agents.

ARXIV Cancer: glioma Method: dual-branch CNN-Transformer

InfiltrNet: Dual-Branch CNN-Transformer Architecture for Brain Tumor Infiltration Risk Prediction

S M Asif Hossain, Shruti Kshirsagar
Published 2026-05-04 05:02

This paper introduces InfiltrNet, a dual-branch architecture that integrates a convolutional neural network with a Swin Transformer to predict brain tumor infiltration risk from multimodal MRI data. The method focuses on estimating the spatial extent of infiltration beyond visible tumor margins, which is crucial for surgical planning and radiation therapy. Extensive experiments show that InfiltrNet outperforms existing models, and explainability analyses confirm its focus on clinically relevant areas.

Read abstract

Gliomas are aggressive brain tumors that infiltrate surrounding tissue beyond the visible tumor margins observed on Magnetic Resonance Imaging (MRI). Predicting the spatial extent of this infiltration is essential for surgical planning and radiation therapy, yet existing deep learning approaches focus on segmenting the visible tumor rather than estimating infiltration risk in the surrounding tissue. This paper presents InfiltrNet, a novel dual-branch architecture that combines a convolutional neural network (CNN) encoder with a Swin Transformer encoder through cross-attention fusion modules to predict three-zone infiltration risk maps from multimodal MRI. A label generation strategy based on distance transforms is proposed to derive reproducible infiltration risk zones from standard Brain Tumor Segmentation (BraTS) annotations. InfiltrNet is trained with a combined Dice-CrossEntropy and boundary-aware loss augmented by auxiliary supervision heads at intermediate decoder levels. Extensive experiments on BraTS 2020 and BraTS 2025 demonstrate that InfiltrNet outperforms five established baselines. Explainability analysis using GradCAM++ and Occlusion sensitivity confirms that the model attends to clinically relevant peritumoral regions.

ARXIV Cancer: brain tumor Method: self-supervised learning

TumorXAI: Self-Supervised Deep Learning Framework for Explainable Brain MRI Tumor Classification

Abrar Hossain Zahin, Amit Kumar Saha, Tanvir Mridha, Saifur Rahman, Jannatul Ferdous Prome, Raima Husna, Israt Jahan, Ahmed Wasif Reza
Published 2026-05-03 18:02

This study presents a self-supervised learning framework for the classification of brain tumors using MRI scans. The authors evaluate multiple self-supervised learning methods, particularly focusing on SimCLR, which achieved high performance metrics on a dataset of 4,448 MRIs. The findings indicate that self-supervised models can outperform traditional supervised approaches, especially when labeled data is scarce, while also enhancing interpretability through Explainable AI techniques.

Read abstract

Classifying brain tumors using magnetic resonance imaging (MRI) is crucial for early diagnosis and treatment; however, tumor heterogeneity and a dearth of annotated datasets restrict the use of supervised deep learning approaches. In this work, we use self-supervised learning (SSL) to study multi-class brain tumor classification. Using a ResNet-50 backbone, we evaluate four SSL frameworks including SimCLR, BYOL, DINO, and Moco v3 on a publicly available dataset of 4,448 MRIs with 17 distinct tumor types. On the dataset, SimCLR achieved 99.64% accuracy, 99.64% precision, 99.64% recall, and 99.64% F1-score. The workflow includes preprocessing, fine-tuning, linear evaluation, and SSL pretraining with data augmentations. Results show that, when labels are limited, SSL-pretrained models outperform supervised baselines in terms of F1-score, recall, accuracy, and precision. Additionally, by providing visual insights into model decisions, Explainable AI techniques (Grad-CAM, Grad-CAM++, EigenCAM) enhance interpretability. These results demonstrate SSL's scalability and dependability in diagnosing brain tumors from unlabeled medical data.

ARXIV Cancer: unknown Method: vision-language model

MedScribe: Clinically Grounded CT Reporting through Agentic Workflows

Giuseppe A. Orlando, Paolo Papotti, Maria A. Zuluaga, Olivier Humbert, Marco Lorenzi
Published 2026-05-03 08:32

The paper presents MedScribe, a novel framework for automated radiology report generation that addresses limitations in existing vision-language models. By reformulating report generation as an iterative evidence acquisition process, MedScribe enhances clinical accuracy and interpretability by dynamically invoking pathology-specific diagnostic tools to extract localized features. The framework shows improved performance on CT-RATE and RadChestCT datasets without requiring task-specific fine-tuning.

Read abstract

Vision-language models (VLMs) have shown potential for automated radiology report generation, yet existing approaches rely on global embedding compression of volumetric data, often leading to hallucinated findings and limited anatomical grounding in 3D CT imaging. We introduce MedScribe, a hypothesis-driven framework that reformulates report generation as an iterative evidence acquisition process rather than a single-pass encoding task. MedScribe models reporting as a sequential decision process in which a large language model dynamically invokes pathology-specific diagnostic tools to extract localized volumetric features. These structured features are used to query a multidimensional retrieval space aligned with pathology-specific textual evidence. By explicitly accumulating quantitative evidence prior to synthesis, the framework enforces fine-grained grounding and reduces unsupported claims. Without task-specific fine-tuning, MedScribe improves clinical accuracy, factual consistency, and interpretability on CT-RATE and RadChestCT compared to state-of-the-art 2D and 3D VLMs, demonstrating the value of hypothesis-driven reasoning for reliable medical image reporting.

ARXIV Cancer: non-small cell lung cancer Method: zero-shot vision-language model

Exploring Prompt Alignment with Clinical Factors in Zero-Shot Segmentation VLMs for NSCLC Tumor Segmentation

Suraj Pai, Thibault Heintz, Cosmin Ciausu, Marion Tonneau, Hugo Aerts, Raymond Mak
Published 2026-05-02 05:49

This study investigates the effectiveness of zero-shot vision-language models (VLMs) for gross tumor volume delineation in non-small-cell lung cancer (NSCLC). By analyzing various prompt dimensions and their impact on spatial attention, the research demonstrates that anatomical location significantly influences model performance. The VLM, VoxTell, achieved a mean Dice Similarity Coefficient of 0.613, comparable to fine-tuned models, highlighting its potential in tumor segmentation tasks.

Read abstract

Zero-shot vision-language models (VLMs) offer a promptable alternative to task-specific training for gross tumor volume (GTV) delineation in non-small-cell lung cancer (NSCLC), but the prompt dimensions that govern their spatial behavior remain poorly understood. We study this question by probing alignment directions in VoxTell on a held-out internal NSCLC tumor dataset through sub-prompt decomposition into diagnosis, demographic, staging, anatomical, generic, and irrelevant controls; attribute-wise perturbation robustness; specificity ladders; and cross-case prompt swaps, while benchmarking against fine-tuned and zero-shot baselines using the Dice Similarity Coefficient (DSC) with Wilcoxon signed-rank tests and Benjamini-Hochberg correction. Alignment analyses revealed that anatomical location is the dominant driver of VoxTell's spatial attention: 63.4 percent of location perturbations caused catastrophic drops, prompt specificity improved from generic to full descriptions except for diagnosis-only prompts, irrelevant prompts correctly yielded zero segmentation, and cross-case prompt swaps confirmed patient-specific conditioning (matched DSC 0.906 vs. mismatched 0.406). Histology and stage substitutions had minimal effect, indicating that the model prioritizes "where to look" over "what to look for." In this context, VoxTell, operating fully zero-shot, achieved a mean DSC of 0.613, statistically indistinguishable from nnUNet (0.690, adjusted p = 0.156) and Ahmed et al. (0.675, adjusted p = 0.679), while significantly outperforming all other zero-shot models. Together, these findings argue that segmentation VLMs should be evaluated not only by Dice, but also by the prompt dimensions to which they align.

ARXIV Cancer: breast cancer Method: machine learning

To Use AI as Dice of Possibilities with Timing Computation

Jia Li, Vipin Kumar, Rui Zhang
Published 2026-05-01 22:25

This paper presents a novel verb-based modeling paradigm for AI that enhances the representation of future possibilities through timing computation. It applies this framework to longitudinal electronic health record (EHR) data from 3,276 breast cancer patients, achieving automatic discovery of significant patient trajectories and counterfactual timing deduction. These results are data-driven and do not require prior domain knowledge, marking a significant advancement in machine learning applications in healthcare.

Read abstract

The dominant noun-based modeling paradigm has fundamentally constrained AI development, precluding any adequate representation of the future as an open temporal dimension. This paper introduces a verb-based paradigm, together with precise definitions of \emph{timing computation} and \emph{possibility}, that enables AI to function as an effective instrument for realizing the grammar of our thought. Applied to longitudinal EHR data from 3,276 breast cancer patients, the framework empirically demonstrates: (1) automatic discovery of clinically significant patient trajectories, and (2) counterfactual timing deduction. Both results are purely data-driven, require no prior domain knowledge, and, to our knowledge, represent the first such demonstrations in the machine learning literature.

ARXIV Cancer: general cancer Method: LoRA fine-tuning

RadLite: Multi-Task LoRA Fine-Tuning of Small Language Models for CPU-Deployable Radiology AI

Pankaj Gupta, Kartik Bose
Published 2026-05-01 05:39

This study explores the use of small language models (SLMs) fine-tuned with LoRA for multi-task radiology applications, aiming to enable deployment in resource-constrained clinical settings. The models were trained on a diverse dataset covering various radiology tasks and demonstrated significant performance improvements over zero-shot baselines. Key findings include the complementary strengths of the models and the effectiveness of LoRA adaptation compared to few-shot prompting.

Read abstract

Large language models (LLMs) show promise in radiology but their deployment is limited by computational requirements that preclude use in resource-constrained clinical environments. We investigate whether small language models (SLMs) of 3-4 billion parameters can achieve strong multi-task radiology performance through LoRA fine-tuning, enabling deployment on consumer-grade CPUs. We train Qwen2.5-3B-Instruct and Qwen3-4B on 162K samples spanning 9 radiology tasks - RADS classification across 10 systems, impression generation, temporal comparison, radiology NLI, NER, abnormality detection, N/M staging, and radiology Q&A - compiled from 12 public datasets. Both models are evaluated on up to 500 held-out test samples per task with standardized metrics. Our key findings are: (1) LoRA fine-tuning dramatically improves performance over zero-shot baselines (RADS accuracy +53%, NLI +60%, N-staging +89%); (2) the two models exhibit complementary strengths - Qwen2.5 excels at structured generation tasks while Qwen3 dominates extractive tasks; (3) a task-outed oracle ensemble combining both models achieves the best performance across all tasks; (4) few-shot prompting with fine-tuned models hurts performance, demonstrating that LoRA adaptation is more effective than in-context learning for specialized domains; and (5) models can be quantized to GGUF format (~1.8-2.4GB) for CPU deployment at 4-8 tokens/second on consumer hardware. Our work demonstrates that small, efficiently fine-tuned models - which we collectively call RadLite - can serve as practical multi-task radiology AI assistants deployable entirely on consumer hardware without GPU requirements. Code and models are available at https://github.com/RadioX-Labs/RadLite

ARXIV Cancer: general cancer Method: unknown

CURE-OOD: Benchmarking Out-of-Distribution Detection for Survival Prediction

Wenjie Zhao, Jia Li, Mingrui Liu, Jing Wang, Yunhui Guo
Published 2026-05-01 02:17

The paper introduces CURE-OOD, a benchmark designed to evaluate out-of-distribution (OOD) detection in cancer survival prediction. It addresses the challenge posed by variations in imaging acquisition that lead to OOD samples, which can undermine model reliability. The study demonstrates that covariate shifts significantly impact survival prediction performance and highlights the inadequacy of traditional OOD detectors in this context. The benchmark facilitates a systematic analysis of the effects of distribution shifts on survival prediction and OOD detectability.

Read abstract

``How long can I live and remain free of cancer?'' is often the first question a patient asks after receiving a cancer diagnosis and treatment. Accurate survival prediction helps alleviate psychological distress and supports risk stratification and personalized treatment planning. Recent survival prediction frameworks have shown strong performance using computed tomography (CT) images. However, variations in imaging acquisition introduce out-of-distribution (OOD) samples caused by covariate shifts that undermine model reliability. Despite this challenge, to our knowledge, no existing benchmark systematically studies OOD detection in cancer survival prediction. To address this gap, we introduce the Cancer sURvival bEnchmark for OOD Detection (CURE-OOD), the first benchmark for systematically evaluating OOD detection in survival prediction under controlled acquisition-induced distribution shifts. CURE-OOD defines scanner-parameter-based training, in-distribution (ID), and OOD test splits across four survival prediction tasks. Our experiments show that covariate shifts notably reduce survival prediction performance. It also shows that mainstream classification-oriented OOD detectors can fail in survival prediction. Finally, we include HazardDev as a simple survival-aware reference baseline for OOD detection. CURE-OOD enables systematic analysis of how distribution shifts affect both downstream survival performance and OOD detectability.

Find the papers that actually matter