Research Papers

ARXIV Cancer: general cancer Method: large language model

AgentRx: A Benchmark Study of LLM Agents for Multimodal Clinical Prediction Tasks

Baraa Al Jorf, Farah E. Shamout
Published 2026-05-11 09:46

This study evaluates the performance of large language model (LLM)-based agents in multimodal clinical prediction tasks using real-world data. The research highlights the advantages of single agent frameworks over naive multi-agent systems in handling heterogeneous data. The findings suggest that improvements in multi-agent collaboration are necessary for better performance in clinical risk prediction. The authors provide an open-source code and evaluation framework to support future research in this area.

Read abstract

Building effective clinical decision support systems requires the synthesis of complex heterogeneous multimodal data. Such modalities include temporal electronic health records data, medical images, radiology reports, and clinical notes. Large language model (LLM)-based agents have shown impressive performance in various healthcare tasks, especially those involving textual modalities. Considering the fragmentation of healthcare data across hospital systems, collaborative agent frameworks present a promising direction to mitigate data sharing challenges. However, the effectiveness of LLM agents for multimodal clinical risk prediction remains largely unexamined. In this work, we conduct a systematic evaluation of LLM-based agents for clinical prediction tasks using large-scale real-world data. We assess performance in unimodal and multimodal settings and quantify performance gaps between single agent and multi-agent systems. Our findings highlight that single agent frameworks outperform naive multi-agent systems, are better at handling multimodal data, and are better calibrated. This underscores a critical need for improving multi-agent collaboration to better handle heterogeneous inputs. By open-sourcing our code and evaluation framework, this work offers a new benchmark to support future developments relating to agentic systems in healthcare.

ARXIV Cancer: oncological Method: vision-language models

Med-StepBench: A Hierarchical Reasoning Framework for Evaluating Hallucinations in Medical Vision-Language Models

Minh Khoi Nguyen, Dai Lam Le, Amir Reza Jafari, Tuan Dung Nguyen, Mai Hong Son, Mai Huy Thong, Quang Huy Nguyen, Thanh Trung Nguyen, Reza Farahbakhsh, Noel Crespi, Phi Le Nguyen
Published 2026-05-11 05:26

This paper presents Med-StepBench, a novel benchmark designed to evaluate hallucinations in vision-language models (VLMs) used for medical image understanding, specifically in 3D oncological PET/CT. The benchmark includes over 12,000 images and more than 1,000,000 image-statement pairs, allowing for a detailed assessment of clinical reasoning across four diagnostic stages. The study reveals significant limitations in current VLMs, particularly their susceptibility to generating clinically plausible yet incorrect statements, which can obscure critical reasoning errors.

Read abstract

Large vision-language models (VLMs) demonstrate strong performance in medical image understanding, but frequently generate clinically plausible yet incorrect statements, raising significant safety concerns. Existing medical hallucination benchmarks primarily focus on 2D imaging with one-shot diagnostic questions, offering limited insight into whether predictions are grounded in correct localization and abnormality identification, allowing critical reasoning errors to remain hidden behind seemingly correct diagnoses. We introduce Med-StepBench, the first large-scale benchmark for step-wise hallucination detection in 3D oncological PET/CT, comprising over 12,000 images and more than 1,000,000 image-statement pairs across volumetric and multi-view 2D data, which decomposes clinical reasoning into four expert-designed diagnostic stages. Using clinician-verified annotations, we perform the first step-level evaluation of general-purpose and medical VLMs, revealing systematic failure modes obscured by aggregate accuracy metrics. Furthermore, we show that current VLMs are highly susceptible to adversarial yet clinically plausible intermediate explanations, which significantly amplify hallucinations despite contradictory visual evidence. Together, our findings highlight fundamental limitations in grounding multi-step clinical reasoning and establish Med-StepBench as a rigorous benchmark for developing safer and more reliable medical VLMs.

ARXIV Cancer: breast cancer Method: conditional energy model

Free Energy Manifold: Score-Based Inference for Hybrid Bayesian Networks

Cheol Young Park, Shou Matsumoto
Published 2026-05-11 00:43

The paper presents the Free Energy Manifold (FEM), a score-trained conditional energy model designed for inference in hybrid Bayesian networks that incorporate both discrete and continuous variables. The study identifies a mode-bridge artifact in standard conditional energy models and introduces valley regularization to improve posterior evaluations. FEM demonstrates significant improvements in KL divergence compared to classical methods, particularly in multimodal scenarios and during compositional inference. The model is evaluated using benchmarks and a UCI Breast Cancer dataset, highlighting its advantages in specific inference tasks.

Read abstract

We introduce the Free Energy Manifold (FEM), a score-trained conditional energy model specialized for inference in hybrid Bayesian networks with discrete and continuous variables. FEM represents each conditional factor as an energy landscape over learned discrete-parent embeddings and continuous observations, enabling posterior evaluation, generative sampling, and compositional inference across multiple continuous leaves by energy addition under conditional independence. A central finding is the mode-bridge artifact: standard conditional energy models can create low-energy ridges between separated modes of the same class, producing overconfident posteriors at off-data interior points. We analyze this failure and propose valley regularization, an off-data calibration term that restores near-uniform posteriors in such regions while preserving in-data fit. Across synthetic multimodal hybrid-BN benchmarks, FEM substantially reduces KL divergence relative to classical baselines and a vanilla conditional EBM, including large gains at mode-bridge midpoint queries and in multi-leaf evidence composition. We also evaluate high-cardinality discrete-parent settings and a UCI Breast Cancer sanity check, showing that FEM is most useful when multimodal or compositional Bayesian-network inference is required, while discriminative classifiers remain preferable for closed-world classification tasks.

ARXIV Cancer: unknown Method: medical vision-language models

DeepTumorVQA: A Hierarchical 3D CT Benchmark for Stage-Wise Evaluation of Medical VLMs and Tool-Augmented Agents

Yixiong Chen, Wenjie Xiao, Pedro R. A. S. Bassi, Boyan Wang, Liang He, Xinze Zhou, Sezgin Er, Ibrahim Ethem Hamamci, Zongwei Zhou, Alan Yuille
Published 2026-05-10 17:57

The paper introduces DeepTumorVQA, a hierarchical benchmark designed for evaluating medical vision-language models (VLMs) and AI agents in tumor diagnosis using 3D CT images. It decomposes the reasoning process into four stages: recognition, measurement, visual reasoning, and medical reasoning, allowing for independent scoring of higher-level questions. The study evaluates over 30 model configurations and identifies reliable quantitative measurement as a key challenge, while tool augmentation significantly improves performance. The benchmark includes a vast dataset of 476K questions across 42 clinical subtypes.

Read abstract

Medical vision-language models (VLMs) and AI agents have made significant progress in learning to analyze and reason about clinical images. However, existing medical visual question answering (VQA) benchmarks collapse model capabilities into a single accuracy score, obscuring where and why models fail. We propose DeepTumorVQA, a hierarchical benchmark that follows the multi-stage evidence chain in tumor diagnosis and decomposes 3D CT reasoning into four stages: recognition, measurement, visual reasoning, and medical reasoning. Higher-level questions remain independently scorable, while their ground-truth evidence chains are defined over lower-level primitives. The benchmark contains 476K questions across 42 clinical subtypes on 9,262 3D CT volumes. In addition to a direct reasoning mode for VLMs, DeepTumorVQA provides tool-interaction environments for agent evaluation, where a model can call external tools, including segmentation models, measurement programs, and medical knowledge modules, before answering the question. Evaluating over 30 model configurations, we find that reliable quantitative measurement is the primary bottleneck, making later-stage visual and medical reasoning harder for VLMs, while tool augmentation substantially mitigates this issue. When tools are available, leveraging medical knowledge and tools to reason about medical images becomes a new challenge. We further show that ground-truth step-by-step tool-use traces from DeepTumorVQA can supervise agents and reduce tool-use and reasoning failures. This stage-wise progression from recognition to measurement to visual and medical reasoning provides a concrete roadmap for future medical VLM and AI agent studies. All data and code are released at https://github.com/Schuture/DeepTumorVQA.

ARXIV Cancer: unknown Method: LoRA-based fine-tuning

LiteMedCoT-VL: Parameter-Efficient Adaptation for Medical Visual Question Answering

Runze Ma, Shunbo Jia, Haonan Lyu, Guo Liu, Caizhi Liao
Published 2026-05-10 07:21

The paper presents LiteMedCoT-VL, a method designed to enhance the reasoning capabilities of compact vision-language models (VLMs) for medical visual question answering (VQA). By employing a pipeline that transfers reasoning from a large teacher model to smaller student models through fine-tuning, the approach aims to bridge the reasoning gap in resource-constrained environments. The results demonstrate that LiteMedCoT-VL achieves superior accuracy on the PMC-VQA benchmark compared to existing models, indicating its effectiveness in clinical decision support.

Read abstract

The reasoning gap between large and compact vision-language models (VLMs) limits the deployment of medical AI on portable clinical devices. Compact VLMs of 2--4B parameters can run on resource-constrained hardware but lack the multi-step reasoning capacity needed for interpretable clinical decision support. Existing knowledge distillation methods transfer answers without the reasoning process behind them. Medical visual question answering (VQA) serves as a testbed for this problem, as it requires models to integrate visual evidence with clinical knowledge through structured reasoning chains. We introduce LiteMedCoT-VL, a pipeline that transfers chain-of-thought reasoning from a 235B teacher model to 2B student models through LoRA-based fine-tuning on explanation-enriched training data. All inference is conducted without image captions by default, simulating the clinical scenario in which a physician interprets a medical image directly without an accompanying radiology report. On the PMC-VQA benchmark, LiteMedCoT-VL achieves 64.9% accuracy, exceeding the zero-shot Qwen3-VL-4B baseline of 53.9% by 11.0 percentage points and outperforming all published baselines. This result indicates that a 2B model with reasoning distillation can match or exceed models with twice the parameters. Visual grounding analysis shows that the model relies on image content rather than exploiting textual priors. Our code is publicly available at https://anonymous.4open.science/r/LiteMedCoT-VL.

ARXIV Cancer: general cancer Method: vision-language model

KEPIL: Knowledge-Enhanced Prompt-Image Learning for Prompt-Robust Disease Detection

Haozhe Luo, Shelley Zixin Shu, Ziyu Zhou, Robert Berke, Mauricio Reyes
Published 2026-05-09 19:29

The paper presents KEPIL, a framework designed to enhance the robustness of vision-language models (VLMs) for disease detection in radiology. It integrates curated medical knowledge to improve zero-shot generalization and addresses the sensitivity of current models to prompt variations. KEPIL employs dynamic prompt enrichment, a semantic-aware contrastive loss, and entity-centric report standardization, achieving state-of-the-art performance across multiple benchmarks. The results indicate that structured knowledge and effective prompt design are crucial for reliable clinical applications.

Read abstract

Vision--language models (VLMs) show promise for clinical decision support in radiology because they enable joint reasoning over radiological images and clinical text, thereby leveraging complementary clinical information. However, radiological findings are long-tailed in practice, leaving some conditions underrepresented and making zero-shot inference essential. Yet current CLIP-style medical VLMs are sensitive to prompt variations and often lack trustworthy external knowledge at inference time, which hinders reliable clinical deployment. We present \textit{KEPIL}, a prompt-robust framework that integrates curated medical knowledge to stabilize zero-shot generalization. KEPIL comprises: (i) \emph{dynamic prompt enrichment} using ontologies with LLM assistance, (ii) a \emph{semantic-aware contrastive loss} aligning embeddings of equivalent prompt variants via a dual-embedding objective, and (iii) \emph{entity-centric report standardization} to yield ontology-aligned representations. Across seven benchmarks, KEPIL achieves state-of-the-art zero-shot inference performance; under prompt-variation tests, it improves AUC by \(6.37\%\) on \textit{CheXpert} and by \(4.11\%\) on average. These results suggest that structured knowledge and robust prompt design are key to clinically reliable radiology-facing VLMs. Code will be released at https://github.com/Roypic/KEPIL.

ARXIV Cancer: brain tumor Method: federated learning

MedFL-Stress: A Systematic Robustness Evaluation of Federated Brain Tumor Segmentation under Cross-Hospital MRI Appearance Shift

Kiran Naseer, Naveed Anwer Butt
Published 2026-05-09 16:04

This paper presents MedFL-Stress, a framework for evaluating the robustness of federated learning models for brain tumor segmentation across different hospitals. The study highlights the importance of assessing model performance at individual sites rather than relying solely on average metrics. By applying various MRI appearance shifts to 2D axial slices from BraTS 2020, the authors demonstrate that traditional evaluation methods can obscure significant performance disparities. The results indicate that the FedBN method improves robustness while maintaining accuracy.

Read abstract

Federated learning enables hospitals to collaboratively train segmentation models without sharing patient data. However, current evaluation protocols report only average performance across clients, masking failures at individual sites. In clinical deployment, a model that fails consistently at one hospital is a real safety risk that a good mean score can hide entirely. We introduce MedFL-Stress, a controlled stress-testing framework that exposes exactly this failure mode. Using 2D axial slices from BraTS 2020 distributed across four simulated hospital clients, we apply graded MRI appearance shifts (gamma contrast, scale-shift, and noise-plus-blur) reflecting scanner and acquisition variability in real multi-site deployments. Three federated baselines are evaluated: FedAvg, FedProx, and FedBN. Worst-hospital Dice and inter-hospital disparity are treated as primary metrics, not supplementary observations. FedAvg achieves the highest global mean Dice (0.8159) but conceals a 0.0850 gap between its best and worst-performing hospital. FedBN closes that gap by 41% (0.0850 to 0.0503) while sacrificing less than half a Dice point in mean accuracy (0.8159 to 0.8109), and the weakest hospital gains 3.5 Dice points outright (0.7309 to 0.7656). These findings demonstrate that robustness-oriented evaluation protocols are essential for reliable federated medical imaging deployment.

ARXIV Cancer: unknown Method: 3D medical vision-language models

Lost in Volume: The CT-SpatialVQA Benchmark for Evaluating Semantic-Spatial Understanding of 3D Medical Vision-Language Models

Mashrafi Monon, Umaima Rahman, Asif Hanif, Numan Saeed, Mohammad Yaqub
Published 2026-05-09 08:16

This paper introduces the CT-SpatialVQA benchmark aimed at evaluating the semantic-spatial reasoning capabilities of 3D medical vision-language models in the context of volumetric images and text. The benchmark consists of 9077 clinically grounded question-answer pairs derived from radiology reports and CT volumes, validated through a robust pipeline. The evaluation reveals significant deficiencies in the models' performance on semantic-spatial reasoning tasks, indicating a need for improved integration of volumetric evidence in clinical applications.

Read abstract

Recent advances in 3D medical vision-language models have enabled joint reasoning over volumetric images and text, showing strong performance in medical visual question-answering (VQA) and report generation. Despite this progress, it remains unclear whether these models learn spatially grounded anatomy from 3D volumes or rely primarily on learned priors and language correlations. This uncertainty stems from the lack of systematic evaluation of semantic-spatial reasoning in volumetric medical VLMs for clinically reliable decision support. To address this gap, we introduce CT-SpatialVQA, a benchmark designed to evaluate semantic-spatial reasoning in 3D CT data. The benchmark comprises 9077 clinically grounded question-answer (QA) pairs derived directly from 1601 radiology reports and CT volumes, which are validated via a robust LLM-assisted pipeline with a 95% human consensus agreement rate. Our dataset requires explicit anatomical localization, laterality awareness, structural comparison, and 3D inter-structure relational reasoning. We also introduce a standardized evaluation protocol and benchmark eight 3D medical VLMs, finding severe degradation on semantic-spatial reasoning tasks, averaging 34% accuracy and often below random, highlighting the need for deeper integration of volumetric evidence for trustworthy clinical use.

ARXIV Cancer: unknown Method: automated pipeline

Automated Optical Density Normalization for Myelin Quantification: Cross-Modal Validation with 7T Ex Vivo MRI

Zahra Khodakarami, Sheina Emrani, Pulkit Khandelwal, Chinmayee Athalye, Amanda Denning, Winifred Trotman, Lisa M Levorse, Eric Teunissen-Bermeo, Hamsanandini Radhakrishnan, Daniel Ohm, Christophe Olm, Noah Capp, Ranjit Ittyerah, Karthik Prabhakaran, John A. Detre, Sandhitsu R. Das, David A. Wolk, Corey T McMillan, Gabor Mizsei, M. Dylan Tisdall, David J Irwin, John L. Robinson, Edward B Lee, Paul A. Yushkevich
Published 2026-05-09 05:46

This study presents an automated pipeline for quantifying myelin concentration by normalizing Optical Density in histological images. The method addresses staining variability that complicates comparisons across histology runs. Validation against expert ratings and 7T ex vivo MRI shows strong correlation, indicating the pipeline effectively captures myelin pathology. This framework lays the groundwork for future in vivo myelin mapping and biomarker discovery.

Read abstract

White matter hyperintensities (WMH) are bright regions on T2-weighted magnetic resonance imaging (MRI) scans and are associated with cerebrovascular pathology and neurodegeneration, including myelin loss. While Luxol Fast Blue histopathology provides visualization of myelin integrity, quantitative analysis requires measuring Optical Density as a proxy for myelin concentration. However, differences in laboratory protocols and tissue processing introduce staining variability that acts as systematic noise, obscuring the biological signal and preventing consistent comparison across histology runs. To address this, we developed an automated pipeline that identifies reference (non-pathologic) regions in whole-slide images to compute normalized Optical Density heatmaps. We validated this approach through two complementary evaluations: (1) comparison against expert ratings of myelin loss severity, and (2) cross-modal spatial comparison with co-registered 7T ex vivo MRI for voxel-wise evaluation within white matter regions. The pipeline's reference selection showed strong concordance with expert-identified reference regions, and normalized Optical Density demonstrated a substantially stronger correlation with MRI signal intensity than raw measurements. This correlation persisted within WMH, confirming that the pipeline captures continuous myelin pathology rather than merely the presence or absence of myelin loss contrast. By mitigating staining artifacts, this pipeline provides a robust, validated framework for quantitative cross-modal comparison, establishing a critical methodological foundation for future translation to in vivo myelin mapping and biomarker discovery.

ARXIV Cancer: unknown Method: convolutional neural network

TimeLesSeg: Unified Contrast-Agnostic Cross-Sectional and Longitudinal MS Lesion Segmentation via a Stochastic Generative Model

Vicent Caselles-Ballester, Eloy Martínez-Heras, Giuseppe Pontillo, Zoe Mendelsohn, Elena M. Marrón, Juan Luis García Fernández, Laia Subirats, Jon Stutters, Jeremy Chataway, Frederik Barkhof, Sara Llufriu, Ferran Prados
Published 2026-05-08 16:19

This paper presents TimeLesSeg, a unified framework for segmenting lesions in multiple sclerosis (MS) using a single convolutional neural network. The method addresses the challenges of automatic lesion segmentation by being contrast-agnostic and capable of processing both cross-sectional and longitudinal data. The results demonstrate that TimeLesSeg outperforms existing state-of-the-art methods in terms of accuracy and efficiency on various datasets.

Read abstract

Multiple sclerosis (MS) expresses substantial clinical and radiological heterogeneity, which poses significant challenges for automatic lesion segmentation. The current deep learning-based SOTA is highly susceptible to changes in both distribution, e.g., changes in scanner; as well as the structure of inputs, evident in the current divide between cross-sectional and longitudinal approaches. We introduce TimeLesSeg, a unified contrast-agnostic framework designed to segment MS lesions regardless of the presence of a temporal dimension in its inputs, with a single convolutional neural network. Our approach models pathological priors through lesion masks, which are processed together with the current scan. Cross-sectional processing is enabled by exposing the model to training cases where no prior information is available, which are modeled with an empty mask, allowing it to operate seamlessly in both scenarios. To overcome the scarcity and inconsistency of longitudinal datasets, we propose a novel generative pipeline in which patterns of lesion evolution are simulated by stochastically deforming each individual lesion with morphological operations, producing realistic prior timepoints. In parallel, we achieve contrast agnosticism through Gaussian mixture model-based domain randomization, enabling the network to experience a wide spectrum of intensity profiles. Results on three publicly available and two in-house datasets show that TimeLesSeg outperforms the contrast-agnostic state of the art on single-modality inputs across overlap- and distance-based metrics. In longitudinal processing, our method outperforms SAMSEG, and captures lesion load dynamics more accurately than both the former and LST-AI. All source code related to the development of TimeLesSeg is available at https://github.com/NeuroADaS-Lab/TimeLesSeg.

Find the papers that actually matter