Research Papers

ARXIV Cancer: unknown Method: Trajectory-Integral Feedback GRPO

Regulating Anatomy-Aware Rewards via Trajectory-Integral Feedback for Volumetric Computed Tomography Analysis

Tianwei Lin, Zhongwei Qiu, Jie Cao, Jiang Liu, Wenjie Yan, Bo Zhang, Yu Zhong, Wenqiao Zhang, Yingda Xia, Ling Zhang
Published 2026-05-19 04:33

This paper addresses the limitations of current medical vision-language models (VLMs) in 3D Computed Tomography (CT) analysis, particularly the mismatch between optimization objectives and clinical accuracy. The authors introduce the Clinical Abnormality Benchmarking Substrate (CABS) to improve the reliability of radiology reports and propose a novel framework called Trajectory-Integral Feedback GRPO (TIF-GRPO) that integrates control-theoretic principles into policy optimization. Experimental results indicate that TIF-GRPO significantly enhances abnormality detection and clinical accuracy in CT analysis.

Read abstract

Medical vision-language models (VLMs) have rapidly advanced as general-purpose multimodal assistants, yet their deployment in 3D Computed Tomography (CT) analysis remains constrained by a persistent mismatch between optimization objectives and clinical rigor. Current Reinforcement Learning (RL) paradigms still rely on lexical proxy signals that induce ``\textit{Evaluation Hallucinations}'', where models optimize linguistic fluency rather than factual clinical correctness, leading to diagnostically critical errors. To bridge this gap, we introduce the \textbf{Clinical Abnormality Benchmarking Substrate (CABS)}, a structured system that decomposes radiology reports into verifiable clinical semantic units. Using CABS, we identify a ``\textit{Mechanistic Divergence}'' in standard RL, where surface-similarity rewards drive policy gradients to bypass medical facts. We therefore propose \textbf{Trajectory-Integral Feedback GRPO (TIF-GRPO)}, a novel framework integrating control-theoretic principles into policy optimization. By formulating clinical reasoning as a pseudo-temporal trajectory for anomaly discovery, TIF-GRPO regulates anatomy-aware rewards via an integral feedback loop that penalizes persistent omissions as cumulative state errors and suppresses hallucinations as excessive control effort. Experiments on 3D CT benchmarks demonstrate that our approach significantly enhances abnormality detection and clinical faithfulness, establishing a new paradigm for fine-grained regulation in medical VLMs. Our project is available at \href{https://github.com/ZJU4HealthCare/TIF-GRPO}{GitHub}.

ARXIV Cancer: general cancer Method: robust framework

Robust Mitigation of Age-Dependent Confounding Effects via Sample-Difficulty Decorrelation

Nikhil Cherian Kurian, Victor Caquilpan Parra, Abin Shoby, Luke Whitbread, Lyle J. Palmer
Published 2026-05-19 01:01

This paper addresses age-dependent performance disparities in medical image classification, which arise due to age acting as a confounder. The authors propose a framework that mitigates these confounding effects by targeting spurious age-linked trends instead of enforcing strict age invariance. Their method utilizes sample difficulty characterization and robust decorrelation techniques to preserve clinically meaningful age information while reducing diagnostic disparities. The approach demonstrates effectiveness across two radiology datasets, maintaining performance stability under varying age distributions.

Read abstract

Age dependent performance disparities in medical image classification often arise because age acts as a confounder, linking imaging morphology with disease prevalence. In practice, disparities can manifest as overdiagnosis at ages where disease prevalence is higher and underdiagnosis at ages where prevalence is lower, and can worsen under train test shifts in the age distribution. Conventional mitigation approaches that enforce strict age invariance may suppress diagnostically meaningful information encoded in age. We therefore propose a robust framework that mitigates the effects of age-dependent confounding by targeting spurious age linked trends rather than enforcing invariance. Following a warm-up phase, we characterize sample difficulty and model its age-dependent trends in a label-conditioned manner. We decorrelate age from dominant age difficulty trends using robust, Huber weighted affinity weights, attenuating confounding-driven shortcuts while preserving clinically meaningful, nonlinear age information. We further introduce an Age Coverage Score that scales the decorrelation penalty by minibatch age variance to ensure stable optimization under limited age diversity. Across two radiology datasets, our approach reduces age dependent true and false positive disparities with minimal AUC impact and remains robust to increasing train test age distribution shifts.

ARXIV Cancer: brain tumor Method: quantization-aware training

Quantized Machine Learning Models for Medical Imaging in Low-Resource Healthcare Settings

Sumanth Meenan Kanneti, Aryan Shah
Published 2026-05-19 00:17

This paper presents a multi-strategy compression framework aimed at improving brain tumor classification from MRI in low-resource healthcare settings. The approach includes quantization-aware training and knowledge distillation from a DenseNet-101 model to a compact DenseNet-32 model, along with Float16 post-training quantization on a MobileNetV2 backbone. Experimental results indicate that the quantized model achieves a validation accuracy of 82.37%, with a significant reduction in model size, demonstrating its potential for clinical application in resource-constrained environments.

Read abstract

Deep learning models have shown strong performance in medical image analysis, but deploying them in low-resource clinical environments remains difficult due to computational, memory, and power constraints. This paper presents a multi-strategy compression framework for brain tumor classification from MRI, encompassing quantization-aware training, knowledge distillation from a DenseNet-101 teacher to a compact DenseNet-32 student with low-bit post-training quantization, and Float16 post-training quantization on a lightweight MobileNetV2 backbone. Using a multi-class brain tumor MRI dataset containing glioma, meningioma, pituitary tumors, and healthy controls, we provide full experimental validation of the MobileNetV2-based pipeline, training the classifier through a three-stage transfer learning process and applying Float16 quantization via TensorFlow Lite. The DenseNet-based distillation and quantization-aware training strategies are described as complementary compression approaches within the framework, with their complete empirical evaluation reserved for future work. Experimental results on the MobileNetV2 pipeline show that the quantized model achieves 82.37 percent validation accuracy compared to the 82.20 percent full-precision baseline, reducing model size from 35.34 MB to 5.76 MB, a 6.14x compression ratio with no meaningful accuracy loss. Per-class evaluation confirms that quantization preserves diagnostic performance uniformly across all four tumor categories. These findings demonstrate that lightweight quantized models can deliver clinically viable brain tumor screening in resource-constrained healthcare settings.

ARXIV Cancer: unknown Method: pretrained domain-adapted diffusion model

Generation of Heterogeneous PET Images from Uniform Organ Activity Maps Using a Pretrained Domain-Adapted Diffusion Model

Suya Li, Kaushik Dutta, Debojyoti Pal, Jingqin Luo, Kooresh I. Shoghi
Published 2026-05-18 21:05

This study presents a pretrained domain-adapted diffusion (PAD) model designed for the synthesis of heterogeneous PET images from uniform organ activity maps. The model employs a two-phase training strategy to enhance both coarse uptake distributions and local image details. Evaluation results indicate that the generated images maintain high quantitative accuracy and comparable tumor segmentation performance to actual PET images, demonstrating the model's potential for clinical applications.

Read abstract

Synthetic PET images are valuable for quantitative imaging workflow development, scalable virtual imaging trials, and deep learning model training, but conventional physics-based simulation approaches are computationally intensive, limited in anatomical variability, and often fail to capture heterogeneous PET uptake. This study developed a pretrained domain-adapted diffusion (PAD) model for anatomy-conditioned PET synthesis from uniform organ activity maps. PAD adopts a natural-image pretrained text-to-image decoder with an upstream conditioning encoder and a downstream PET-domain adapter. A two-phase training strategy was used, with the first phase learning coarse uptake distributions and the second refining local image details. Uniform organ activity maps were generated from CT-based segmentations by assigning each organ its mean uptake from the paired PET image. Evaluation included quantitative accuracy, noise assessment, radiomic analysis, tumor segmentation performance, and a human observer study. PAD-generated images achieved high quantitative accuracy, with concordance correlation coefficients above 0.92 between organ mean SUVs and assigned activity values. The synthesized images showed noise levels and texture characteristics similar to target PET images and produced comparable tumor segmentation performance. In a two-alternative forced-choice observer study, four readers achieved approximately 50% accuracy, indicating visual indistinguishability between synthesized and target images. PAD also generated realistic PET images from XCAT-derived activity maps, demonstrating compatibility with phantom-based anatomical priors. Overall, PAD provides a diffusion-based framework for generating clinically relevant heterogeneous PET images from uniform organ activity maps derived from clinical segmentations or digital phantoms, supporting data augmentation and downstream imaging studies.

ARXIV Cancer: general cancer Method: classical machine learning classifiers

Beyond Morphology: Quantifying the Diagnostic Power of Color Features in Cancer Classification

Farnaz Kheiri, Shahryar Rahnamayan, Masoud Makrehchi
Published 2026-05-18 15:10

This study investigates the effectiveness of color features in cancer classification, specifically focusing on the discriminative power of global color features without considering morphological information. The authors extracted statistical color moments and color histograms, evaluating their performance using classical machine learning classifiers across various experimental settings. Results indicate that color features alone can achieve high classification accuracies, suggesting their potential as an effective pre-screening tool in cancer diagnostics.

Read abstract

In histopathology, human experts primarily rely on color as a means of enhancing contrast to interpret tissue morphology, whereas machine vision models process color as raw statistical information. This distinction raises a fundamental question: to what extent can pixel intensity alone, independent of structural and morphological cues, support cancer classification? To address this question, we systematically evaluated the standalone discriminative power of global color features while deliberately excluding all morphological information. Specifically, we extracted statistical color moments and discretized RGB and HSV color histograms, and assessed their performance across ten diverse experimental settings using classical machine learning classifiers. Our results demonstrate that color features alone can achieve strong performance in binary diagnostic tasks (e.g., benign versus malignant), with classification accuracies reaching up to 89%. This performance is likely attributable to global chromatic shifts associated with malignancy. Importantly, these simple color-based representations consistently outperformed random baselines by a substantial margin, indicating that raw color distributions encode a non-random and diagnostically relevant signal for cancer detection. Consequently, this study suggests that simple, computationally efficient color features can serve as an effective pre-screening tool. By identifying samples with strong chromatic indicators of malignancy, these lightweight models could function as a first-pass triage system, reducing the computational burden on complex deep learning architectures.

ARXIV Cancer: unknown Method: self-distilled masked image transformer

Benchmarking transferability of SSL pretraining to same and different modality segmentation tasks

Jue Jiang, Harini Veeraraghavan
Published 2026-05-18 14:41

This study evaluates the transferability of self-supervised learning (SSL) methods for segmentation tasks using 3D CT scans. The research integrates a Swin Transformer encoder into a SwinUNETR-style segmentation network and fine-tunes it on various public segmentation tasks. The results indicate that the self-distilled masked image transformer (SMIT) achieved the highest segmentation accuracy and demonstrated superior data efficiency, particularly in few-shot scenarios.

Read abstract

Methods: Nine SSL methods spanning four pretext-task families were pretrained from scratch using the same 10{,}412 3D CT scans (1.89~M 2D axial slices) covering varied disease sites. The pretrained Swin Transformer encoder from each method was integrated into a SwinUNETR-style segmentation network (Swin encoder with a 3D CNN decoder and skip connections) and fine-tuned on nine public segmentation tasks of varying complexity, including large abdominal organs, head-and-neck structures, and tumors from CT and MRI. Performance was assessed using Dice similarity coefficient (DSC). Fine-tuning convergence speed, transferability across modalities (CT-to-MRI), and feature-reuse patterns between few- and many-shot fine tuning were further analyzed using centered kernel alignment. Results: Self-distilled masked image transformer (SMIT), which combines masked image modeling (MIM) with local and global self-distillation, achieved the highest overall segmentation accuracy across the nine tasks, the fastest fine-tuning convergence, and the smallest few-shot-to-many-shot performance gap, indicating the strongest data efficiency. SMIT also showed the most consistent feature-reuse patterns between few- and many-shot fine tuning. MIM-based SimMIM and self-distillation methods (DINO, iBOT) outperformed contrastive learning and rotation prediction, which rely on image-level global representations. Differences between SSL methods were largest in the few-shot setting and narrowed as the size of the labeled fine-tuning dataset increased, indicating that the choice of SSL pretraining matters most under limited annotation budgets.

ARXIV Cancer: colorectal cancer Method: multimodal learning

Geometry-Aware Uncertainty Coresets for Robust Visual In-Context Learning in Histopathology

Franciskus Xaverius Erick, Johanna Paula Müller, Bernhard Kainz
Published 2026-05-18 13:54

This paper presents GAUC, a training-free coreset selection method designed to enhance the performance of vision-language models (VLMs) in computational histopathology. The method optimizes three objectives to improve the reliability of in-context learning by addressing issues related to data selection and query phrasing. Experimental results demonstrate that GAUC significantly enhances accuracy, calibration, and robustness in diagnostics without requiring parameter updates.

Read abstract

Vision-language models (VLMs) can couple visual perception with open-ended clinical reasoning, making them attractive for computational histopathology. However, fine-tuning billions of parameters on scarce, expert-annotated pathology data is prohibitive, while in-context learning (ICL), which conditions the VLM on demonstrative image-text pairs without parameter updates, suffers from high sensitivity to which examples are selected and how the query is phrased, producing unreliable diagnostics. Existing selection strategies rely on query-dependent nearest-neighbour retrieval that ignores global data structure, require costly parameter updates, or disregard the joint vision-text embedding geometry of VLMs. We propose GAUC, a training-free coreset selection method operating directly in the pre-trained multimodal embedding space. GAUC jointly optimises three objectives: (1) a Maximum Mean Discrepancy term enforcing distributional fidelity between coreset and full dataset, (2) an Effective Mutual Information Difference regulariser bounding performance degradation under prompt paraphrases by exploiting the VLM's joint vision-text alignment, and (3) a predictive-variance penalty suppressing overconfident, unstable outputs. On CRC-100K and MHIST across multiple open-source VLM architectures, GAUC consistently improves accuracy, calibration, and prompt robustness over recent ICL selection methods and dataset-distillation baselines, all without a single gradient update.

ARXIV Cancer: general cancer Method: agentic multimodal learning

PULSE: Agentic Investigation with Passive Sensing for Proactive Intervention in Cancer Survivorship

Zhiyuan Wang, Ariful Islam, Indrajeet Ghosh, Xinyu Chen, Katharine E. Daniel, Subigya Nepal, Philip Chow, Laura E. Barnes
Published 2026-05-17 22:39

The paper presents PULSE, a system designed to enhance emotional support for cancer survivors by utilizing passive smartphone sensing. It employs large language model (LLM) agents that autonomously analyze behavioral data against personalized baselines to improve the accuracy of emotional distress predictions. The evaluation indicates that agentic reasoning significantly enhances performance in emotion regulation and intervention prediction, suggesting its potential in proactive mental health support for cancer survivors.

Read abstract

Cancer survivors face elevated rates of depression, anxiety, and general emotional distress, yet the precise moments they most need support are often the moments when self-report is sparse, a phenomenon we term the diary paradox. Passive smartphone sensing offers a continuous, unobtrusive alternative, but prior sensing-based affect prediction has been limited by an accuracy ceiling, suggesting a bottleneck not only in available data, but in how behavioral signals are interpreted. We present PULSE, a system that shifts from fixed feature pipelines to agentic sensing investigation: LLM agents equipped with eight purpose-built tools autonomously query smartphone sensing data, compare current behavior against personalized baselines, and calibrate inferences through retrieval-augmented population-level comparisons. Rather than receiving pre-formatted feature summaries, agents decide which modalities to inspect, how far back to look, and how deeply to investigate, mirroring hypothesis-driven clinical reasoning. We evaluate PULSE through a 2*2 factorial design crossing reasoning architecture (structured vs. agentic) with data modality (sensing-only vs. with diary) on 50 cancer survivors from a longitudinal study of cancer survivors. Agentic reasoning is the primary driver of performance: agentic multimodal agent achieves balanced accuracy of 0.743 for emotion regulation desire with diary and sensing data, while agentic agents predict intervention availability at 0.713 with passive sensing data only. These results suggest that agentic investigation may be a cornerstone for unlocking the clinical value of passive sensing, advancing the feasibility of proactive just-in-time mental health support.

ARXIV Cancer: cervical cancer Method: vision transformer

Systematic Evaluation of Vision Transformers for Automated Cervical Cancer Classification: Optimization, Statistical Validation, and Clinical Interpretability

Nisreen Albzour, Sarah S. Lam
Published 2026-05-17 03:16

This study evaluates Vision Transformer (ViT) architectures for the automated classification of cervical cancer from Pap smear images. The research optimizes ViT-Tiny to enhance screening accuracy and clinical interpretability, achieving a cross-validation accuracy of 94.9%-95.2%. The findings demonstrate that Vision Transformers can provide accurate decision support while maintaining transparency, addressing limitations of traditional convolutional neural networks.

Read abstract

Manual Pap smear analysis for cervical cancer screening is limited by inter-observer variability, time constraints, and restricted expert availability. Although convolutional neural networks (CNNs) have automated cervical cell classification, they remain limited in modeling long-range spatial dependencies and often lack clinical interpretability. In this study, Vision Transformer (ViT) architectures were systematically optimized to enhance automated cervical cancer screening, which resulted in improved interpretability. The Herlev dataset (917 images: 242 normal, 675 abnormal) was utilized to optimize ViT-Tiny, a lightweight Vision Transformer architecture designed for reduced computational complexity, through a comprehensive evaluation of augmentation strategies, class weighting, and hyperparameters. The optimal configuration achieved 94.9%-95.2% cross-validation accuracy, in which random horizontal flipping and class weighting (0.7 x 1.3) were identified as most effective. Gradient-weighted Class Activation Mapping (Grad-CAM) analysis confirmed that model attention corresponded to clinically relevant morphological features, which include nuclear regions, cell boundaries, and chromatin texture, which align with cytopathological criteria. These findings indicate that Vision Transformers can deliver accurate and interpretable decision support for cervical cancer screening, which fulfills both clinical performance and transparency requirements essential for medical AI deployment.

ARXIV Cancer: glioma Method: vision-language models

UCSF-PDGM-VQA: Visual Question Answering dataset for brain tumor MRI interpretation

Shiv Ghosh, Junayd Lateef, Chih-Hua, Liu, Yannan Yu, Andreas M. Rauschecker, Madhumita Sushil
Published 2026-05-16 20:10

This paper presents the UCSF-PDGM-VQA dataset, a visual question answering benchmark designed for the interpretation of brain tumor MRIs. It highlights the challenges faced by current vision-language models in processing complex multi-sequence MRI scans, revealing deficiencies in their reliability and safety for clinical use. The study establishes a performance baseline for several state-of-the-art models and emphasizes the need for more robust, domain-specific solutions in neuro-oncology.

Read abstract

Brain tumor diagnosis is largely dependent on Magnetic Resonance Imaging (MRI) evaluation, which requires radiologists to synthesize thousands of images across multiple 3D sequences and longitudinal studies. This process requires advanced neuro-radiology training, poses substantial cognitive load, and is highly time-consuming. Despite increasing demands in radiology, this expertise is difficult to scale, straining the current health systems. Vision-Language Models (VLMs) provide an opportunity to reduce this burden through a semi-automated, interactive interpretation of complex brain MRIs. However, they are currently underutilized in neuro-oncology due to a lack of specialized benchmarks for evaluating them. We introduce a clinically relevant visual question answering (VQA) benchmark -- the UCSF-PDGM-VQA dataset -- consisting of 2,387 QA pairs from 473 glioma-related MRI studies in the public UCSF-PDGM dataset. We further establish a performance baseline for six state-of-the-art vision-language models (VLMs) and one large language model on this dataset. We find that current models are incapable of effectively processing multi-sequence, 3-dimensional MRI scans, thus resulting in a suppression of visual features and over-reliance on language priors, causing modality collapse. These findings underscore a critical deficiency in current model reliability and safety within clinical settings, necessitating the development of robust, domain-specific VLMs.

Find the papers that actually matter