Research Papers

ARXIV Cancer: general cancer Method: interactive foundation model

VISTA-PATH: An interactive foundation model for pathology image segmentation and quantitative analysis in computational pathology

Peixian Liang, Songhao Li, Shunsuke Koga, Yutong Li, Zahra Alipour, Yucheng Tang, Daguang Xu, Zhi Huang
Published 2026-01-23 05:06

The paper introduces VISTA-PATH, an interactive foundation model designed for semantic segmentation of histopathology images, which enhances quantitative tissue analysis and clinical modeling. This model incorporates expert feedback and visual context to produce precise multi-class segmentations across diverse pathology images. VISTA-PATH outperforms existing models in extensive benchmarks and supports dynamic human-in-the-loop refinement, ultimately improving tissue microenvironment analysis and correlating with patient survival.

Read abstract

Accurate semantic segmentation for histopathology image is crucial for quantitative tissue analysis and downstream clinical modeling. Recent segmentation foundation models have improved generalization through large-scale pretraining, yet remain poorly aligned with pathology because they treat segmentation as a static visual prediction task. Here we present VISTA-PATH, an interactive, class-aware pathology segmentation foundation model designed to resolve heterogeneous structures, incorporate expert feedback, and produce pixel-level segmentation that are directly meaningful for clinical interpretation. VISTA-PATH jointly conditions segmentation on visual context, semantic tissue descriptions, and optional expert-provided spatial prompts, enabling precise multi-class segmentation across heterogeneous pathology images. To support this paradigm, we curate VISTA-PATH Data, a large-scale pathology segmentation corpus comprising over 1.6 million image-mask-text triplets spanning 9 organs and 93 tissue classes. Across extensive held-out and external benchmarks, VISTA-PATH consistently outperforms existing segmentation foundation models. Importantly, VISTA-PATH supports dynamic human-in-the-loop refinement by propagating sparse, patch-level bounding-box annotation feedback into whole-slide segmentation. Finally, we show that the high-fidelity, class-aware segmentation produced by VISTA-PATH is a preferred model for computational pathology. It improve tissue microenvironment analysis through proposed Tumor Interaction Score (TIS), which exhibits strong and significant associations with patient survival. Together, these results establish VISTA-PATH as a foundation model that elevates pathology image segmentation from a static prediction to an interactive and clinically grounded representation for digital pathology. Source code and demo can be found at https://github.com/zhihuanglab/VISTA-PATH.

ARXIV Cancer: general cancer Method: attention-guided attribution

Cite-While-You-Generate: Training-Free Evidence Attribution for Multimodal Clinical Summarization

Qianqi Yan, Huy Nguyen, Sumana Srivatsa, Hari Bandi, Xin Eric Wang, Krishnaram Kenthapadi
Published 2026-01-23 02:01

This paper presents a training-free framework for evidence attribution in multimodal clinical summarization, focusing on the transparency of generated content. The proposed methods utilize decoder attentions to cite supporting text spans or images, addressing limitations of previous approaches. Evaluations demonstrate that the framework outperforms existing baselines in attribution accuracy across clinician-patient dialogues and radiology reports.

Read abstract

Trustworthy clinical summarization requires not only fluent generation but also transparency about where each statement comes from. We propose a training-free framework for generation-time source attribution that leverages decoder attentions to directly cite supporting text spans or images, overcoming the limitations of post-hoc or retraining-based methods. We introduce two strategies for multimodal attribution: a raw image mode, which directly uses image patch attentions, and a caption-as-span mode, which substitutes images with generated captions to enable purely text-based alignment. Evaluations on two representative domains: clinician-patient dialogues (CliConSummation) and radiology reports (MIMIC-CXR), show that our approach consistently outperforms embedding-based and self-attribution baselines, improving both text-level and multimodal attribution accuracy (e.g., +15% F1 over embedding baselines). Caption-based attribution achieves competitive performance with raw-image attention while being more lightweight and practical. These findings highlight attention-guided attribution as a promising step toward interpretable and deployable clinical summarization systems.

ARXIV Cancer: unknown Method: federated learning

FeTTL: Federated Template and Task Learning for Multi-Institutional Medical Imaging

Abhijeet Parida, Antonia Alomar, Zhifan Jiang, Pooneh Roshanitabrizi, Austin Tapp, Ziyue Xu, Syed Muhammad Anwar, Maria J. Ledesma-Carbayo, Holger R. Roth, Marius George Linguraru
Published 2026-01-22 20:14

This paper presents Federated Template and Task Learning (FeTTL), a framework aimed at improving model performance in federated learning settings for medical imaging. The method addresses challenges posed by domain shifts and data heterogeneity across different medical institutions. FeTTL was evaluated on tasks related to retinal fundus optical disc segmentation and histopathological metastasis classification, demonstrating significant performance improvements over existing federated learning approaches.

Read abstract

Federated learning enables collaborative model training across geographically distributed medical centers while preserving data privacy. However, domain shifts and heterogeneity in data often lead to a degradation in model performance. Medical imaging applications are particularly affected by variations in acquisition protocols, scanner types, and patient populations. To address these issues, we introduce Federated Template and Task Learning (FeTTL), a novel framework designed to harmonize multi-institutional medical imaging data in federated environments. FeTTL learns a global template together with a task model to align data distributions among clients. We evaluated FeTTL on two challenging and diverse multi-institutional medical imaging tasks: retinal fundus optical disc segmentation and histopathological metastasis classification. Experimental results show that FeTTL significantly outperforms the state-of-the-art federated learning baselines (p-values <0.002) for optical disc segmentation and classification of metastases from multi-institutional data. Our experiments further highlight the importance of jointly learning the template and the task. These findings suggest that FeTTL offers a principled and extensible solution for mitigating distribution shifts in federated learning, supporting robust model deployment in real-world, multi-institutional environments.

ARXIV Cancer: colon cancer Method: convolutional neural network

Phi-SegNet: Phase-Integrated Supervision for Medical Image Segmentation

Shams Nafisa Ali, Taufiq Hasan
Published 2026-01-22 16:00

The paper presents Phi-SegNet, a CNN-based architecture designed to enhance medical image segmentation by integrating phase-aware information. This approach addresses the limitations of existing models that primarily focus on spatial information, by incorporating frequency-domain representations for improved object localization. The model was evaluated on multiple public datasets and demonstrated state-of-the-art performance, achieving significant improvements in intersection over union (IoU) and F1-score metrics. The findings suggest that leveraging spectral priors can enhance segmentation frameworks across various imaging modalities.

Read abstract

Deep learning has substantially advanced medical image segmentation, yet achieving robust generalization across diverse imaging modalities and anatomical structures remains a major challenge. A key contributor to this limitation lies in how existing architectures, ranging from CNNs to Transformers and their hybrids, primarily encode spatial information while overlooking frequency-domain representations that capture rich structural and textural cues. Although few recent studies have begun exploring spectral information at the feature level, supervision-level integration of frequency cues-crucial for fine-grained object localization-remains largely untapped. To this end, we propose Phi-SegNet, a CNN-based architecture that incorporates phase-aware information at both architectural and optimization levels. The network integrates Bi-Feature Mask Former (BFMF) modules that blend neighboring encoder features to reduce semantic gaps, and Reverse Fourier Attention (RFA) blocks that refine decoder outputs using phase-regularized features. A dedicated phase-aware loss aligns these features with structural priors, forming a closed feedback loop that emphasizes boundary precision. Evaluated on five public datasets spanning X-ray, US, histopathology, MRI, and colonoscopy, Phi-SegNet consistently achieved state-of-the-art performance, with an average relative improvement of 1.54+/-1.26% in IoU and 0.98+/-0.71% in F1-score over the next best-performing model. In cross-dataset generalization scenarios involving unseen datasets from the known domain, Phi-SegNet also exhibits robust and superior performance, highlighting its adaptability and modality-agnostic design. These findings demonstrate the potential of leveraging spectral priors in both feature representation and supervision, paving the way for generalized segmentation frameworks that excel in fine-grained object localization.

ARXIV Cancer: unknown Method: self-supervised learning

RadJEPA: Radiology Encoder for Chest X-Rays via Joint Embedding Predictive Architecture

Anas Anwarul Haq Khan, Mariam Husain, Kshitij Jadhav
Published 2026-01-22 12:11

This paper presents RadJEPA, a self-supervised framework designed for learning robust radiology encoders from chest X-ray images without relying on language supervision. The model utilizes a Joint Embedding Predictive Architecture to predict latent representations of masked image regions. Evaluation results demonstrate that RadJEPA outperforms existing state-of-the-art methods in various tasks, including disease classification and semantic segmentation.

Read abstract

Recent advances in medical vision language models guide the learning of visual representations; however, this form of supervision is constrained by the availability of paired image text data, raising the question of whether robust radiology encoders can be learned without relying on language supervision. In this work, we introduce RadJEPA, a self-supervised framework built on a Joint Embedding Predictive Architecture that learns without language supervision. Pre-trained solely on unlabeled chest X-ray images, the model learns to predict latent representations of masked image regions. This predictive objective differs fundamentally from both image text pre-training and DINO-style self-distillation: rather than aligning global representations across views or modalities, RadJEPA explicitly models latent-space prediction. We evaluate the learned encoder on disease classification, semantic segmentation, and report generation tasks. Across benchmarks, RadJEPA achieves performance exceeding state-of-the-art approaches, including Rad-DINO.

ARXIV Cancer: general cancer Method: image-to-image translation

PMPBench: A Paired Multi-Modal Pan-Cancer Benchmark for Medical Image Synthesis

Yifan Chen, Fei Yin, Hao Chen, Jia Wu, Chao Li
Published 2026-01-22 11:58

This paper introduces PMPBench, a novel public dataset designed for paired multi-modal medical image synthesis across various cancer types. The dataset addresses limitations in existing resources by providing fully paired dynamic contrast-enhanced MR sequences and corresponding non-contrast and contrast-enhanced CT acquisitions. The goal is to facilitate AI-based image translation for synthesizing contrast-enhanced images from non-contrast scans, thereby improving clinical workflows in oncology. The authors also establish a benchmark for evaluating image-to-image translation methods using this dataset.

Read abstract

Contrast medium plays a pivotal role in radiological imaging, as it amplifies lesion conspicuity and improves detection for the diagnosis of tumor-related diseases. However, depending on the patient's health condition or the medical resources available, the use of contrast medium is not always feasible. Recent work has explored AI-based image translation to synthesize contrast-enhanced images directly from non-contrast scans, aims to reduce side effects and streamlines clinical workflows. Progress in this direction has been constrained by data limitations: (1) existing public datasets focus almost exclusively on brain-related paired MR modalities; (2) other collections include partially paired data but suffer from missing modalities/timestamps and imperfect spatial alignment; (3) explicit labeling of CT vs. CTC or DCE phases is often absent; (4) substantial resources remain private. To bridge this gap, we introduce the first public, fully paired, pan-cancer medical imaging dataset spanning 11 human organs. The MR data include complete dynamic contrast-enhanced (DCE) sequences covering all three phases (DCE1-DCE3), while the CT data provide paired non-contrast and contrast-enhanced acquisitions (CTC). The dataset is curated for anatomical correspondence, enabling rigorous evaluation of 1-to-1, N-to-1, and N-to-N translation settings (e.g., predicting DCE phases from non-contrast inputs). Built upon this resource, we establish a comprehensive benchmark. We report results from representative baselines of contemporary image-to-image translation. We release the dataset and benchmark to catalyze research on safe, effective contrast synthesis, with direct relevance to multi-organ oncology imaging workflows. Our code and dataset are publicly available at https://github.com/YifanChen02/PMPBench.

ARXIV Cancer: brain tumor Method: foundation models

Sub-Region-Aware Modality Fusion and Adaptive Prompting for Multi-Modal Brain Tumor Segmentation

Shadi Alijani, Fereshteh Aghaee Meibodi, Homayoun Najjaran
Published 2026-01-22 08:03

This paper presents a novel framework for adapting foundation models to multi-modal medical imaging, specifically for brain tumor segmentation. The framework incorporates sub-region-aware modality attention and adaptive prompt engineering to enhance segmentation accuracy. Validation on the BraTS 2020 dataset shows significant improvements over baseline methods, particularly in challenging tumor sub-regions.

Read abstract

The successful adaptation of foundation models to multi-modal medical imaging is a critical yet unresolved challenge. Existing models often struggle to effectively fuse information from multiple sources and adapt to the heterogeneous nature of pathological tissues. To address this, we introduce a novel framework for adapting foundation models to multi-modal medical imaging, featuring two key technical innovations: sub-region-aware modality attention and adaptive prompt engineering. The attention mechanism enables the model to learn the optimal combination of modalities for each tumor sub-region, while the adaptive prompting strategy leverages the inherent capabilities of foundation models to refine segmentation accuracy. We validate our framework on the BraTS 2020 brain tumor segmentation dataset, demonstrating that our approach significantly outperforms baseline methods, particularly in the challenging necrotic core sub-region. Our work provides a principled and effective approach to multi-modal fusion and prompting, paving the way for more accurate and robust foundation model-based solutions in medical imaging.

ARXIV Cancer: skin cancer Method: convolutional neural network

A Machine Vision Approach to Preliminary Skin Lesion Assessments

Ali Khreis, Ro'Yah Radaideh, Quinn McGill
Published 2026-01-21 23:48

This study focuses on the early detection of malignant skin lesions to improve patient outcomes in aggressive skin cancers. It evaluates a system that combines the ABCD rule of dermoscopy with machine learning classification, using a subset of the HAM10000 dataset. The research highlights the effectiveness of a custom three-layer Convolutional Neural Network (CNN) which achieved a significant accuracy improvement over traditional methods. The findings suggest that direct pixel-level learning can capture diagnostic patterns more effectively than handcrafted features.

Read abstract

Early detection of malignant skin lesions is critical for improving patient outcomes in aggressive, metastatic skin cancers. This study evaluates a comprehensive system for preliminary skin lesion assessment that combines the clinically established ABCD rule of dermoscopy (analyzing Asymmetry, Borders, Color, and Dermoscopic Structures) with machine learning classification. Using a 1,000-image subset of the HAM10000 dataset, the system implements an automated, rule-based pipeline to compute a Total Dermoscopy Score (TDS) for each lesion. This handcrafted approach is compared against various machine learning solutions, including traditional classifiers (Logistic Regression, Random Forest, and SVM) and deep learning models. While the rule-based system provides high clinical interpretability, results indicate a performance bottleneck when reducing complex morphology to five numerical features. Experimental findings show that transfer learning with EfficientNet-B0 failed significantly due to domain shift between natural and medical images. In contrast, a custom three-layer Convolutional Neural Network (CNN) trained from scratch achieved 78.5% accuracy and 86.5% recall on median-filtered images, representing a 19-point accuracy improvement over traditional methods. The results demonstrate that direct pixel-level learning captures diagnostic patterns beyond handcrafted features and that purpose-built lightweight architectures can outperform large pretrained models for small, domain-specific medical datasets.

ARXIV Cancer: general cancer Method: multimodal learning

CURE: Curriculum-guided Multi-task Training for Reliable Anatomy Grounded Report Generation

Pablo Messina, Andrés Villa, Juan León Alcázar, Karen Sánchez, Carlos Hinojosa, Denis Parra, Álvaro Soto, Bernard Ghanem
Published 2026-01-21 19:19

The paper presents CURE, a curriculum-guided multi-task training framework aimed at improving the accuracy of visual grounding and the quality of radiology report generation. By fine-tuning a multimodal instructional model on various tasks, CURE enhances spatial and textual alignment without requiring additional data. The results indicate significant improvements in grounding accuracy and report quality, along with a reduction in hallucinations.

Read abstract

Medical vision-language models can automate the generation of radiology reports but struggle with accurate visual grounding and factual consistency. Existing models often misalign textual findings with visual evidence, leading to unreliable or weakly grounded predictions. We present CURE, an error-aware curriculum learning framework that improves grounding and report quality without any additional data. CURE fine-tunes a multimodal instructional model on phrase grounding, grounded report generation, and anatomy-grounded report generation using public datasets. The method dynamically adjusts sampling based on model performance, emphasizing harder samples to improve spatial and textual alignment. CURE improves grounding accuracy by +0.37 IoU, boosts report quality by +0.188 CXRFEScore, and reduces hallucinations by 18.6%. CURE is a data-efficient framework that enhances both grounding accuracy and report reliability. Code is available at https://github.com/PabloMessina/CURE and model weights at https://huggingface.co/pamessina/medgemma-4b-it-cure

ARXIV Cancer: unknown Method: Generative Adversarial Network

GeMM-GAN: A Multimodal Generative Model Conditioned on Histopathology Images and Clinical Descriptions for Gene Expression Profile Generation

Francesca Pia Panaccione, Carlo Sgaravatti, Pietro Pinoli
Published 2026-01-21 19:03

The paper presents GeMM-GAN, a novel Generative Adversarial Network designed to synthesize realistic gene expression profiles by integrating histopathology images and clinical metadata. The model employs a Transformer Encoder and a Cross Attention mechanism to generate biologically coherent profiles. Evaluation on the TCGA dataset shows that GeMM-GAN outperforms existing generative models, achieving over 11% improvement in accuracy for disease type prediction.

Read abstract

Biomedical research increasingly relies on integrating diverse data modalities, including gene expression profiles, medical images, and clinical metadata. While medical images and clinical metadata are routinely collected in clinical practice, gene expression data presents unique challenges for widespread research use, mainly due to stringent privacy regulations and costly laboratory experiments. To address these limitations, we present GeMM-GAN, a novel Generative Adversarial Network conditioned on histopathology tissue slides and clinical metadata, designed to synthesize realistic gene expression profiles. GeMM-GAN combines a Transformer Encoder for image patches with a final Cross Attention mechanism between patches and text tokens, producing a conditioning vector to guide a generative model in generating biologically coherent gene expression profiles. We evaluate our approach on the TCGA dataset and demonstrate that our framework outperforms standard generative models and generates more realistic and functionally meaningful gene expression profiles, improving by more than 11\% the accuracy on downstream disease type prediction compared to current state-of-the-art generative models. Code will be available at: https://github.com/francescapia/GeMM-GAN

Find the papers that actually matter