Research Papers

ARXIV Cancer: skin cancer Method: transformer

Composed Vision-Language Retrieval for Skin Cancer Case Search via Joint Alignment of Global and Local Representations

Yuheng Wang, Yuji Lin, Dongrun Zhu, Jiayue Cai, Sunil Kalia, Harvey Lui, Chunqi Chang, Z. Jane Wang, Tim K. Lee
Published 2026-03-10 02:42

This study focuses on composed vision-language retrieval for skin cancer, aiming to enhance diagnostic decision-making through improved medical image retrieval. The authors propose a transformer-based framework that learns hierarchical query representations and performs joint global-local alignment between image-text pairs and candidate images. Experiments on the Derm7pt dataset show that this method achieves consistent improvements over existing state-of-the-art approaches, facilitating efficient access to relevant medical records.

Read abstract

Medical image retrieval aims to identify clinically relevant lesion cases to support diagnostic decision making, education, and quality control. In practice, retrieval queries often combine a reference lesion image with textual descriptors such as dermoscopic features. We study composed vision-language retrieval for skin cancer, where each query consists of an image to text pair and the database contains biopsy-confirmed, multi-class disease cases. We propose a transformer based framework that learns hierarchical composed query representations and performs joint global-local alignment between queries and candidate images. Local alignment aggregates discriminative regions via multiple spatial attention masks, while global alignment provides holistic semantic supervision. The final similarity is computed through a convex, domain-informed weighting that emphasizes clinically salient local evidence while preserving global consistency. Experiments on the public Derm7pt dataset demonstrate consistent improvements over state-of-the-art methods. The proposed framework enables efficient access to relevant medical records and supports practical clinical deployment.

ARXIV Cancer: general cancer Method: multi-modal large language model

Meissa: Multi-modal Medical Agentic Intelligence

Yixiong Chen, Xinyi Bai, Yue Pan, Zongwei Zhou, Alan Yuille
Published 2026-03-09 23:22

The paper presents Meissa, a lightweight multi-modal large language model (MM-LLM) designed for offline medical applications. It incorporates agentic capabilities to enhance decision-making in clinical environments by learning to engage in multi-step interactions. The model is trained on 40,000 curated trajectories and demonstrates superior performance compared to existing frontier models across various medical benchmarks. Meissa operates with significantly reduced latency and fewer parameters, addressing the challenges of API-based deployments.

Read abstract

Multi-modal large language models (MM-LLMs) have shown strong performance in medical image understanding and clinical reasoning. Recent medical agent systems extend them with tool use and multi-agent collaboration, enabling complex decision-making. However, these systems rely almost entirely on frontier models (e.g., GPT), whose API-based deployment incurs high cost, high latency, and privacy risks that conflict with on-premise clinical requirements. We present Meissa, a lightweight 4B-parameter medical MM-LLM that brings agentic capability offline. Instead of imitating static answers, Meissa learns both when to engage external interaction (strategy selection) and how to execute multi-step interaction (strategy execution) by distilling structured trajectories from frontier models. Specifically, we propose: (1) Unified trajectory modeling: trajectories (reasoning and action traces) are represented within a single state-action-observation formalism, allowing one model to generalize across heterogeneous medical environments. (2) Three-tier stratified supervision: the model's own errors trigger progressive escalation from direct reasoning to tool-augmented and multi-agent interaction, explicitly learning difficulty-aware strategy selection. (3) Prospective-retrospective supervision: pairing exploratory forward traces with hindsight-rationalized execution traces enables stable learning of effective interaction policies. Trained on 40K curated trajectories, Meissa matches or exceeds proprietary frontier agents in 10 of 16 evaluation settings across 13 medical benchmarks spanning radiology, pathology, and clinical reasoning. Using over 25x fewer parameters than typical frontier models like Gemini-3, Meissa operates fully offline with 22x lower end-to-end latency compared to API-based deployment. Data, models, and environments are released at https://github.com/Schuture/Meissa.

ARXIV Cancer: general cancer Method: large language model

PathoScribe: Transforming Pathology Data into a Living Library with a Unified LLM-Driven Framework for Semantic Retrieval and Clinical Integration

Abdul Rehman Akbar, Samuel Wales-McGrath, Alejadro Levya, Lina Gokhale, Rajendra Singh, Wei Chen, Anil Parwani, Muhammad Khalid Khan Niazi
Published 2026-03-09 21:09

The paper presents PathoScribe, a unified retrieval-augmented large language model (LLM) framework aimed at transforming static pathology archives into a searchable and reasoning-enabled living library. The system facilitates natural language case exploration, automated cohort construction, and clinical question answering, achieving high performance in retrieval and reasoning tasks. Evaluated on a large dataset of surgical pathology reports, PathoScribe demonstrates significant improvements in efficiency and accuracy compared to traditional methods.

Read abstract

Pathology underpins modern diagnosis and cancer care, yet its most valuable asset, the accumulated experience encoded in millions of narrative reports, remains largely inaccessible. Although institutions are rapidly digitizing pathology workflows, storing data without effective mechanisms for retrieval and reasoning risks transforming archives into a passive data repository, where institutional knowledge exists but cannot meaningfully inform patient care. True progress requires not only digitization, but the ability for pathologists to interrogate prior similar cases in real time while evaluating a new diagnostic dilemma. We present PathoScribe, a unified retrieval-augmented large language model (LLM) framework designed to transform static pathology archives into a searchable, reasoning-enabled living library. PathoScribe enables natural language case exploration, automated cohort construction, clinical question answering, immunohistochemistry (IHC) panel recommendation, and prompt-controlled report transformation within a single architecture. Evaluated on 70,000 multi-institutional surgical pathology reports, PathoScribe achieved perfect Recall@10 for natural language case retrieval and demonstrated high-quality retrieval-grounded reasoning (mean reviewer score 4.56/5). Critically, the system operationalized automated cohort construction from free-text eligibility criteria, assembling research-ready cohorts in minutes (mean 9.2 minutes) with 91.3% agreement to human reviewers and no eligible cases incorrectly excluded, representing orders-of-magnitude reductions in time and cost compared to traditional manual chart review. This work establishes a scalable foundation for converting digital pathology archives from passive storage systems into active clinical intelligence platforms.

ARXIV Cancer: melanoma Method: transfer learning

A Lightweight Multi-Cancer Tumor Localization Framework for Deployable Digital Pathology

Brian Isett, Rebekah Dadey, Aofei Li, Ryan C. Augustin, Kate Smith, Aatur D. Singhi, Qiangqiang Gu, Riyue Bao
Published 2026-03-09 19:00

This study presents a multi-cancer tumor localization model (MuCTaL) designed to improve tumor detection across various cancer types using a balanced training approach. The model was trained on a large dataset of tiles from four different cancers and utilized transfer learning with DenseNet169. Results indicated high performance in detecting tumors, achieving a tile-level ROC-AUC of 0.97 on validation data and 0.71 on an independent cohort. The developed framework also allows for the generation of spatial tumor probability heatmaps for digital pathology applications.

Read abstract

Accurate localization of tumor regions from hematoxylin and eosin-stained whole-slide images is fundamental for translational research including spatial analysis, molecular profiling, and tissue architecture investigation. However, deep learning-based tumor detection trained within specific cancers may exhibit reduced robustness when applied across different tumor types. We investigated whether balanced training across cancers at modest scale can achieve high performance and generalize to unseen tumor types. A multi-cancer tumor localization model (MuCTaL) was trained on 79,984 non-overlapping tiles from four cancers (melanoma, hepatocellular carcinoma, colorectal cancer, and non-small cell lung cancer) using transfer learning with DenseNet169. The model achieved a tile-level ROC-AUC of 0.97 in validation data from the four training cancers, and 0.71 on an independent pancreatic ductal adenocarcinoma cohort. A scalable inference workflow was built to generate spatial tumor probability heatmaps compatible with existing digital pathology tools. Code and models are publicly available at https://github.com/AivaraX-AI/MuCTaL.

ARXIV Cancer: brain tumor Method: vision transformer

Agentic LLM Workflow for MR Spectroscopy Volume-of-Interest Placements in Brain Tumors

Sangyoon Lee, Francesca Branzoli, Małgorzata Marjańska, Patrick Bolan
Published 2026-03-09 18:13

This study presents an agentic large language model (LLM) workflow designed to enhance the placement of magnetic resonance spectroscopy (MRS) volume-of-interest (VOI) in brain tumors. The method involves generating diverse candidate VOIs using vision transformer-based models, allowing for the selection of optimal placements based on quantitative metrics. The proposed approach demonstrates improved tumor coverage and necrosis avoidance in clinical cases compared to traditional expert placements.

Read abstract

Magnetic resonance spectroscopy (MRS) provides clinically valuable metabolic characterization of brain tumors, but its utility depends on accurate placement of the spectroscopy volume-of-interest (VOI). However, VOI placement typically has a broad operating window: for a given tumor there are multiple possible VOIs that would lead to high-quality MRS measurements. Thus, a VOI place-ment can be tuned for clinician preference, case-specific anatomy, and clinical pri-orities, which leads to high inter-operator variability, especially for heterogeneous tumors. We propose an agentic large language model (LLM) workflow that de-composes VOI placement into generation of diverse candidate VOIs, from which the LLM selects an optimal one based on quantitative metrics. Candidate VOIs are generated by vision transformer-based placement models trained with differ-ent objective function preferences, which allows selection from acceptable alterna-tives rather than a single deterministic placement. On 110 clinical brain tumor cas-es, the agentic workflow achieves improved solid tumor coverage and necrosis avoidance depending on the user preferences compared to the general-purpose expert placements. Overall, the proposed workflow provides a strategy to adapt VOI placement to different clinical objectives without retraining task-specific models.

ARXIV Cancer: colorectal cancer Method: weakly supervised learning

Weakly Supervised Teacher-Student Framework with Progressive Pseudo-mask Refinement for Gland Segmentation

Hikmat Khan, Wei Chen, Muhammad Khalid Khan Niazi
Published 2026-03-09 16:54

This study presents a weakly supervised teacher-student framework aimed at improving gland segmentation in colorectal cancer histopathological grading. The method utilizes sparse pathologist annotations and an Exponential Moving Average stabilized teacher network to generate refined pseudo masks. Evaluation on multiple datasets demonstrated the framework's effectiveness, achieving a mean Intersection over Union of 80.10 and a mean Dice coefficient of 89.10, indicating robust generalization across cohorts.

Read abstract

Background and objectives: Colorectal cancer histopathological grading depends on accurate segmentation of glandular structures. Current deep learning approaches rely on large scale pixel level annotations that are labor intensive and difficult to obtain in routine clinical practice. Weakly supervised semantic segmentation offers a promising alternative. However, class activation map based methods often produce incomplete pseudo masks that emphasize highly discriminative regions and fail to supervise unannotated glandular structures. We propose a weakly supervised teacher student framework that leverages sparse pathologist annotations and an Exponential Moving Average stabilized teacher network to generate refined pseudo masks. Methods: The framework integrates confidence based filtering, adaptive fusion of teacher predictions with limited ground truth, and curriculum guided refinement to progressively segment unannotated glandular regions. The method was evaluated on an institutional colorectal cancer cohort from The Ohio State University Wexner Medical Center consisting of 60 hematoxylin and eosin stained whole slide images and on public datasets including the Gland Segmentation dataset, TCGA COAD, TCGA READ, and SPIDER. Results: On the Gland Segmentation dataset the framework achieved a mean Intersection over Union of 80.10 and a mean Dice coefficient of 89.10. Cross cohort evaluation demonstrated robust generalization on TCGA COAD and TCGA READ without additional annotations, while reduced performance on SPIDER reflected domain shift. Conclusions: The proposed framework provides an annotation efficient and generalizable approach for gland segmentation in colorectal histopathology.

ARXIV Cancer: glioma Method: rectified flow model

Rectified flow-based prediction of post-treatment brain MRI from pre-radiotherapy priors for patients with glioma

Selena Huisman, Nordin Belkacemi, Vera Keil, Joost Verhoeff, Szabolcs David
Published 2026-03-09 13:43

This study investigates the use of AI-driven conditional image generation to predict follow-up MRI for patients with glioma after radiotherapy. A 2D rectified flow model was developed using the SAILOR dataset, incorporating pre-treatment MRI and treatment data. The model demonstrated high fidelity in generating realistic MRI images, achieving a structural similarity index measure (SSIM) of 0.88 and a Dice coefficient of 0.91. This approach has potential applications in optimizing treatment planning and predicting patient outcomes.

Read abstract

Purpose/Objective: Brain tumors result in 20 years of lost life on average. Standard therapies induce complex structural changes in the brain that are monitored through MRI. Recent developments in artificial intelligence (AI) enable conditional multimodal image generation from clinical data. In this study, we investigate AI-driven generation of follow-up MRI in patients with in- tracranial tumors through conditional image generation. This approach enables realistic modeling of post-radiotherapy changes, allowing for treatment optimization. Material/Methods: The public SAILOR dataset of 25 patients was used to create a 2D rectified flow model conditioned on axial slices of pre-treatment MRI and RT dose maps. Cross-attention conditioning was used to incorporate temporal and chemotherapy data. The resulting images were validated with structural similarity index measure (SSIM), peak signal-to-noise ratio (PSNR), Dice scores and Jacobian determinants. Results: The resulting model generates realistic follow-up MRI for any time point, while integrating treatment information. Comparing real versus predicted images, SSIM is 0.88, and PSNR is 22.82. Tissue segmentations from real versus predicted MRI result in a mean Dice-Sørensen coefficient (DSC) of 0.91. The rectified flow (RF) model enables up to 250x faster inference than Denoising Diffusion Probabilistic Models (DDPM). Conclusion: The proposed model generates realistic follow-up MRI in real-time, preserving both semantic and visual fidelity as confirmed by image quality metrics and tissue segmentations. Conditional generation allows counterfactual simulations by varying treatment parameters, producing predicted morphological changes. This capability has potential to support adaptive treatment dose planning and personalized outcome prediction for patients with intracranial tumors.

ARXIV Cancer: head and neck cancer Method: multiple instance learning

Beyond Attention Heatmaps: How to Get Better Explanations for Multiple Instance Learning Models in Histopathology

Mina Jamshidi Idaji, Julius Hense, Tom Neuhäuser, Augustin Krause, Yanqing Luo, Oliver Eberle, Thomas Schnake, Laure Ciernik, Farnoush Rezaei Jafari, Reza Vahidimajd, Jonas Dippel, Christoph Walz, Frederick Klauschen, Andreas Mock, Klaus-Robert Müller
Published 2026-03-09 12:45

This study introduces a framework for evaluating the quality of heatmaps generated by multiple instance learning (MIL) models in histopathology. The authors conduct a benchmark experiment to assess various explanation methods and find that the quality of explanations is influenced by the model architecture and task type. They demonstrate the correlation of MIL heatmaps with spatial transcriptomics and highlight the discovery of strategies for predicting human papillomavirus (HPV) infection from head and neck cancer slides.

Read abstract

Multiple instance learning (MIL) has enabled substantial progress in computational histopathology, where a large amount of patches from gigapixel whole slide images are aggregated into slide-level predictions. Heatmaps are widely used to validate MIL models and to discover tissue biomarkers. Yet, the validity of these heatmaps has barely been investigated. In this work, we introduce a general framework for evaluating the quality of MIL heatmaps without requiring additional labels. We conduct a large-scale benchmark experiment to assess six explanation methods across histopathology task types (classification, regression, survival), MIL model architectures (Attention-, Transformer-, Mamba-based), and patch encoder backbones (UNI2, Virchow2). Our results show that explanation quality mostly depends on MIL model architecture and task type, with perturbation ("Single"), layer-wise relevance propagation (LRP), and integrated gradients (IG) consistently outperforming attention-based and gradient-based saliency heatmaps, which often fail to reflect model decision mechanisms. We further demonstrate the advanced capabilities of the best-performing explanation methods: (i) We provide a proof-of-concept that MIL heatmaps of a bulk gene expression prediction model can be correlated with spatial transcriptomics for biological validation, and (ii) showcase the discovery of distinct model strategies for predicting human papillomavirus (HPV) infection from head and neck cancer slides. Our work highlights the importance of validating MIL heatmaps and establishes that improved explainability can enable more reliable model validation and yield biological insights, making a case for a broader adoption of explainable AI in digital pathology. Our code is provided in a public GitHub repository: https://github.com/bifold-pathomics/xMIL/tree/xmil-journal

ARXIV Cancer: general cancer Method: transfer learning

EndoSERV: A Vision-based Endoluminal Robot Navigation System

Junyang Wu, Fangfang Xie, Minghui Zhang, Hanxiao Zhang, Jiayuan Sun, Yun Gu, Guang-Zhong Yang
Published 2026-03-09 12:44

This paper introduces EndoSERV, a vision-based localization method designed to enhance robot-assisted endoluminal procedures for early cancer intervention. The method addresses challenges in navigation due to tissue deformation and lack of distinctive landmarks by employing a segment-to-structure and real-to-virtual mapping approach. Extensive experiments validate the effectiveness of EndoSERV, demonstrating its capability to operate without real pose labels.

Read abstract

Robot-assisted endoluminal procedures are increasingly used for early cancer intervention. However, the intricate, narrow and tortuous pathways within the luminal anatomy pose substantial difficulties for robot navigation. Vision-based navigation offers a promising solution, but existing localization approaches are error-prone due to tissue deformation, in vivo artifacts and a lack of distinctive landmarks for consistent localization. This paper presents a novel EndoSERV localization method to address these challenges. It includes two main parts, \textit{i.e.}, \textbf{SE}gment-to-structure and \textbf{R}eal-to-\textbf{V}irtual mapping, and hence the name. For long-range and complex luminal structures, we divide them into smaller sub-segments and estimate the odometry independently. To cater for label insufficiency, an efficient transfer technique maps real image features to the virtual domain to use virtual pose ground truth. The training phases of EndoSERV include an offline pretraining to extract texture-agnostic features, and an online phase that adapts to real-world conditions. Extensive experiments based on both public and clinical datasets have been performed to demonstrate the effectiveness of the method even without any real pose labels.

ARXIV Cancer: unknown Method: latent diffusion model

Retrieval-Augmented Anatomical Guidance for Text-to-CT Generation

Daniele Molino, Camillo Maria Caruso, Paolo Soda, Valerio Guarrasi
Published 2026-03-09 12:27

This paper presents a retrieval-augmented approach for generating CT images from text descriptions, addressing the limitations of existing methods that either lack anatomical guidance or rely on ground-truth annotations. The proposed method utilizes a 3D vision-language encoder to retrieve clinically relevant cases and incorporates their anatomical annotations into a text-conditioned latent diffusion model. Experimental results demonstrate that this approach enhances image fidelity and clinical consistency while allowing for explicit spatial control.

Read abstract

Text-conditioned generative models for volumetric medical imaging provide semantic control but lack explicit anatomical guidance, often resulting in outputs that are spatially ambiguous or anatomically inconsistent. In contrast, structure-driven methods ensure strong anatomical consistency but typically assume access to ground-truth annotations, which are unavailable when the target image is to be synthesized. We propose a retrieval-augmented approach for Text-to-CT generation that integrates semantic and anatomical information under a realistic inference setting. Given a radiology report, our method retrieves a semantically related clinical case using a 3D vision-language encoder and leverages its associated anatomical annotation as a structural proxy. This proxy is injected into a text-conditioned latent diffusion model via a ControlNet branch, providing coarse anatomical guidance while maintaining semantic flexibility. Experiments on the CT-RATE dataset show that retrieval-augmented generation improves image fidelity and clinical consistency compared to text-only baselines, while additionally enabling explicit spatial controllability, a capability inherently absent in such approaches. Further analysis highlights the importance of retrieval quality, with semantically aligned proxies yielding consistent gains across all evaluation axes. This work introduces a principled and scalable mechanism to bridge semantic conditioning and anatomical plausibility in volumetric medical image synthesis. Code will be released.

Find the papers that actually matter