Research Papers

ARXIV Cancer: general cancer Method: LLM-Guided Diagnostic Evidence Alignment

LLM-Guided Diagnostic Evidence Alignment for Medical Vision-Language Pretraining under Limited Pairing

Huimin Yan, Liang Bai, Xian Yang, Long Chen
Published 2026-02-07 13:29

This paper presents the LLM-Guided Diagnostic Evidence Alignment (LGDEA) method, which aims to improve medical vision-language pretraining by focusing on evidence-level alignment rather than traditional global or local alignment. The approach utilizes large language models to extract key diagnostic evidence from radiology reports, facilitating effective cross-modal alignment with unpaired medical images and reports. Experimental results indicate that LGDEA significantly enhances performance in tasks such as phrase grounding, image-text retrieval, and zero-shot classification.

Read abstract

Most existing CLIP-style medical vision--language pretraining methods rely on global or local alignment with substantial paired data. However, global alignment is easily dominated by non-diagnostic information, while local alignment fails to integrate key diagnostic evidence. As a result, learning reliable diagnostic representations becomes difficult, which limits their applicability in medical scenarios with limited paired data. To address this issue, we propose an LLM-Guided Diagnostic Evidence Alignment method (LGDEA), which shifts the pretraining objective toward evidence-level alignment that is more consistent with the medical diagnostic process. Specifically, we leverage LLMs to extract key diagnostic evidence from radiology reports and construct a shared diagnostic evidence space, enabling evidence-aware cross-modal alignment and allowing LGDEA to effectively exploit abundant unpaired medical images and reports, thereby substantially alleviating the reliance on paired data. Extensive experimental results demonstrate that our method achieves consistent and significant improvements on phrase grounding, image--text retrieval, and zero-shot classification, and even rivals pretraining methods that rely on substantial paired data.

ARXIV Cancer: unknown Method: multimodal learning

MedMO: Grounding and Understanding Multimodal Large Language Model for Medical Images

Ankan Deria, Komal Kumar, Adinath Madhavrao Dukre, Eran Segal, Salman Khan, Imran Razzak
Published 2026-02-06 18:59

This paper presents MedMO, a medical multimodal foundation model designed to enhance the application of large language models in the medical field. The model employs a multi-stage training approach that includes cross-modal pretraining and reinforcement learning to improve alignment and reasoning capabilities across various medical tasks. MedMO demonstrates significant performance improvements over existing medical baselines in visual question answering, text-based question answering, and medical report generation.

Read abstract

Multimodal large language models have advanced rapidly, but their adoption in medicine is constrained by limited domain coverage, imperfect modality alignment, and insufficient grounded reasoning. We introduce MedMO, a medical multimodal foundation model built on a general MLLM architecture and trained exclusively on large-scale domain-specific data. MedMO uses a multi-stage training recipe that includes cross-modal pretraining to align heterogeneous visual encoders with a medical language backbone, instruction tuning with multi-task supervision spanning captioning, VQA, report generation, retrieval, and bounding-box disease localization, and reinforcement learning with verifiable rewards that combine factuality checks with a box-level GIoU signal to improve spatial grounding and step-by-step reasoning in challenging clinical settings. Across modalities and tasks, MedMO surpasses strong open-source medical baselines. MedMO-8B-Next achieves consistent gains on VQA benchmarks, improving by 6.6% on average over Fleming-VL-8B, including gains of 6.0% on MMMU-Med, 9.8% on PMC-VQA, and 21.3% on MedXpertQA. On text-based QA, it improves by 14.4% over Fleming-VL-8B, driven by gains of 8.4% on MMLU-Med and 30.1% on MedQA. For medical report generation, it improves by 6.7% on MIMIC-CXR. MedMO-8B-Next also demonstrates strong grounding performance, reaching 56.1 IoU on Bacteria, which is a 47.8 IoU gain over Fleming-VL-8B. At smaller scale, MedMO-4B-Next remains competitive and exceeds Fleming-VL-8B across VQA, QA, and report generation. Evaluations spanning radiology, ophthalmology, and pathology microscopy further confirm broad cross-modality generalization. Project is available at https://genmilab.github.io/MedMO-Page

ARXIV Cancer: breast cancer Method: interpretable classification algorithm

SR4-Fit: An Interpretable and Informative Classification Algorithm Applied to Prediction of U.S. House of Representatives Elections

Shyam Sundar Murali Krishnan, Dean Frederick Hougen
Published 2026-02-05 22:18

The paper presents Sparse Relaxed Regularized Regression Rule-Fit (SR4-Fit), an interpretable classification algorithm designed to improve upon traditional rule-based algorithms and black-box models in predictive performance. The authors demonstrate its application in predicting U.S. House election outcomes using demographic data, achieving high accuracy and interpretability. Additionally, SR4-Fit is validated on various datasets, including breast cancer, showing consistent performance across different domains.

Read abstract

The growth of machine learning demands interpretable models for critical applications, yet most high-performing models are ``black-box'' systems that obscure input-output relationships, while traditional rule-based algorithms like RuleFit suffer from a lack of predictive power and instability despite their simplicity. This motivated our development of Sparse Relaxed Regularized Regression Rule-Fit (SR4-Fit), a novel interpretable classification algorithm that addresses these limitations while maintaining superior classification performance. Using demographic characteristics of U.S. congressional districts from the Census Bureau's American Community Survey, we demonstrate that SR4-Fit can predict House election party outcomes with unprecedented accuracy and interpretability. Our results show that while the majority party remains the strongest predictor, SR4-Fit has revealed intrinsic combinations of demographic factors that affect prediction outcomes that were unable to be interpreted in black-box algorithms such as random forests. The SR4-Fit algorithm surpasses both black-box models and existing interpretable rule-based algorithms such as RuleFit with respect to accuracy, simplicity, and robustness, generating stable and interpretable rule sets while maintaining superior predictive performance, thus addressing the traditional trade-off between model interpretability and predictive capability in electoral forecasting. To further validate SR4-Fit's performance, we also apply it to six additional publicly available classification datasets, like the breast cancer, Ecoli, page blocks, Pima Indians, vehicle, and yeast datasets, and find similar results.

ARXIV Cancer: endometrial cancer Method: residual variational autoencoder

Unsupervised Anomaly Detection of Diseases in the Female Pelvis for Real-Time MR Imaging

Anika Knupfer, Johanna P. Müller, Jordina A. Verdera, Martin Fenske, Claudius S. Mathy, Smiti Tripathy, Sebastian Arndt, Matthias May, Michael Uder, Matthias W. Beckmann, Stefanie Burghaus, Jana Hutter
Published 2026-02-05 20:33

This study presents a benchmark framework for unsupervised anomaly detection in pelvic MRI, aimed at addressing the challenges of high anatomical variability and the need for real-time compatibility. The method employs a residual variational autoencoder trained on healthy scans to model normal pelvic anatomy, allowing for the detection of pathological regions through reconstruction error heatmaps. The framework demonstrates promising quantitative results, with an average AUC of 0.736, and is evaluated on conditions such as endometrial cancer and endometriosis.

Read abstract

Pelvic diseases in women of reproductive age represent a major global health burden, with diagnosis frequently delayed due to high anatomical variability, complicating MRI interpretation. Existing AI approaches are largely disease-specific and lack real-time compatibility, limiting generalizability and clinical integration. To address these challenges, we establish a benchmark framework for disease- and parameter-agnostic, real-time-compatible unsupervised anomaly detection in pelvic MRI. The method uses a residual variational autoencoder trained exclusively on healthy sagittal T2-weighted scans acquired across diverse imaging protocols to model normal pelvic anatomy. During inference, reconstruction error heatmaps indicate deviations from learned healthy structure, enabling detection of pathological regions without labeled abnormal data. The model is trained on 294 healthy scans and augmented with diffusion-generated synthetic data to improve robustness. Quantitative evaluation on the publicly available Uterine Myoma MRI Dataset yields an average area-under-the-curve (AUC) value of 0.736, with 0.828 sensitivity and 0.692 specificity. Additional inter-observer clinical evaluation extends analysis to endometrial cancer, endometriosis, and adenomyosis, revealing the influence of anatomical heterogeneity and inter-observer variability on performance interpretation. With a reconstruction time of approximately 92.6 frames per second, the proposed framework establishes a baseline for unsupervised anomaly detection in the female pelvis and supports future integration into real-time MRI. Code is available upon request (https://github.com/AniKnu/UADPelvis), prospective data sets are available for academic collaboration.

ARXIV Cancer: ameloblastoma Method: multimodal deep learning

A Unified Multimodal Framework for Dataset Construction and Model-Based Diagnosis of Ameloblastoma

Ajo Babu George, Anna Mariam John, Athul Anoop, Balu Bhasuran
Published 2026-02-05 10:15

This study presents a newly curated multimodal dataset focused on ameloblastoma, integrating various types of clinical and imaging data. A multimodal deep learning model was developed to classify ameloblastoma variants and assess recurrence risk, achieving significant improvements in classification accuracy and abnormal tissue detection. The model incorporates clinical inputs to enhance personalized inference and supports surgical planning.

Read abstract

Artificial intelligence (AI)-enabled diagnostics in maxillofacial pathology require structured, high-quality multimodal datasets. However, existing resources provide limited ameloblastoma coverage and lack the format consistency needed for direct model training. We present a newly curated multimodal dataset specifically focused on ameloblastoma, integrating annotated radiological, histopathological, and intraoral clinical images with structured data derived from case reports. Natural language processing techniques were employed to extract clinically relevant features from textual reports, while image data underwent domain specific preprocessing and augmentation. Using this dataset, a multimodal deep learning model was developed to classify ameloblastoma variants, assess behavioral patterns such as recurrence risk, and support surgical planning. The model is designed to accept clinical inputs such as presenting complaint, age, and gender during deployment to enhance personalized inference. Quantitative evaluation demonstrated substantial improvements; variant classification accuracy increased from 46.2 percent to 65.9 percent, and abnormal tissue detection F1-score improved from 43.0 percent to 90.3 percent. Benchmarked against resources like MultiCaRe, this work advances patient-specific decision support by providing both a robust dataset and an adaptable multimodal AI framework.

ARXIV Cancer: cervical cancer Method: attention-based multiple instance learning

CLEAR-HPV: Interpretable Concept Discovery for HPV-Associated Morphology in Whole-Slide Histology

Weiyi Qin, Yingci Liu-Swetz, Shiwei Tan, Hao Wang
Published 2026-02-04 23:18

The study presents CLEAR-HPV, a framework designed to enhance the interpretability of attention-based multiple instance learning (MIL) for HPV-associated histopathology. By restructuring the MIL latent space, CLEAR-HPV enables the automatic discovery of morphologic concepts without requiring labeled data during training. The framework successfully reduces high-dimensional feature representations to a compact set of interpretable concepts while maintaining predictive accuracy across various datasets.

Read abstract

Human papillomavirus (HPV) status is a critical determinant of prognosis and treatment response in head and neck and cervical cancers. Although attention-based multiple instance learning (MIL) achieves strong slide-level prediction for HPV-related whole-slide histopathology, it provides limited morphologic interpretability. To address this limitation, we introduce Concept-Level Explainable Attention-guided Representation for HPV (CLEAR-HPV), a framework that restructures the MIL latent space using attention to enable concept discovery without requiring concept labels during training. Operating in an attention-weighted latent space, CLEAR-HPV automatically discovers keratinizing, basaloid, and stromal morphologic concepts, generates spatial concept maps, and represents each slide using a compact concept-fraction vector. CLEAR-HPV's concept-fraction vectors preserve the predictive information of the original MIL embeddings while reducing the high-dimensional feature space (e.g., 1536 dimensions) to only 10 interpretable concepts. CLEAR-HPV generalizes consistently across TCGA-HNSCC, TCGA-CESC, and CPTAC-HNSCC, providing compact, concept-level interpretability through a general, backbone-agnostic framework for attention-based MIL models of whole-slide histopathology.

ARXIV Cancer: colorectal cancer Method: deep learning

XtraLight-MedMamba for Classification of Neoplastic Tubular Adenomas

Aqsa Sultana, Rayan Afsar, Ahmed Rahu, Surendra P. Singh, Brian Shula, Brandon Combs, Derrick Forchetti, Vijayan K. Asari
Published 2026-02-04 18:07

This study presents XtraLight-MedMamba, a deep learning framework designed for the classification of neoplastic tubular adenomas from whole-slide images. The model integrates a ConvNeXt-based feature extractor with vision mamba blocks and a Spatial and Channel Attention Bridge to enhance feature extraction. Evaluated on a dataset of low-grade tubular adenomas, the model achieved an accuracy of 97.18% and an F1-score of 0.9767, demonstrating superior performance compared to existing architectures.

Read abstract

Accurate risk stratification of precancerous polyps during routine colonoscopy screening is a key strategy to reduce the incidence of colorectal cancer (CRC). However, assessment of low-grade dysplasia remains limited by subjective histopathologic interpretation. Advances in computational pathology and deep learning offer new opportunities to identify subtle, fine morphologic patterns associated with malignant progression that may be imperceptible to the human eye. In this work, we propose XtraLight-MedMamba, an ultra-lightweight state-space-based deep learning framework to classify neoplastic tubular adenomas from whole-slide images (WSIs). The architecture is a blend of a ConvNeXt-based shallow feature extractor with parallel vision mamba blocks to efficiently model local texture cues within global contextual structure. An integration of the Spatial and Channel Attention Bridge (SCAB) module enhances multiscale feature extraction, while the Fixed Non-Negative Orthogonal Classifier (FNOClassifier) enables substantial parameter reduction and improved generalization. The model was evaluated on a curated dataset acquired from patients with low-grade tubular adenomas, stratified into case and control cohorts based on subsequent CRC development. XtraLight-MedMamba achieved an accuracy of 97.18\% and an F1-score of 0.9767 using approximately 32,000 parameters, outperforming transformer-based and conventional Mamba architectures, which have significantly higher model complexity and computational burden, making it suitable for resource-constrained areas.

ARXIV Cancer: unknown Method: self-supervised learning

OmniRad: A Radiological Foundation Model for Multi-Task Medical Image Analysis

Luca Zedda, Andrea Loddo, Cecilia Di Ruberto
Published 2026-02-04 13:38

This paper presents OmniRad, a self-supervised radiological foundation model designed for multi-task medical image analysis. Pretrained on 1.2 million medical images, it emphasizes representation reuse and cross-task transferability. The model is evaluated across various benchmarks for classification and segmentation, demonstrating improved performance over existing foundation models.

Read abstract

Radiological analysis increasingly benefits from pretrained visual representations that can support heterogeneous downstream tasks across imaging modalities. In this work, we introduce OmniRad, a self-supervised radiological foundation model pretrained on 1.2 million medical images, designed with radiology-inspired principles emphasizing representation reuse and cross-task transferability. We evaluate the pretrained encoder under multiple downstream adaptation regimes, including lightweight task-specific adapters with a frozen backbone as well as full end-to-end fine-tuning for classification, allowing us to assess both representation quality and task-specific performance. OmniRad is evaluated on a broad suite of public benchmarks spanning classification and segmentation across multiple modalities. On the MedMNISTv2 collection, OmniRad improves classification F1 by up to 2.05% over competing foundation models. For dense prediction, OmniRad attains mean Dice score improvements across six MedSegBench datasets when using frozen representations. Qualitative analyses and latent-space visualizations suggest improved feature clustering and modality-related separation.

ARXIV Cancer: general cancer Method: federated learning

Med-MMFL: A Multimodal Federated Learning Benchmark in Healthcare

Aavash Chhetri, Bibek Niroula, Pratik Shrestha, Yash Raj Shrestha, Lesley A Anderson, Prashnna K Gyawali, Loris Bazzani, Binod Bhattarai
Published 2026-02-04 10:50

This paper introduces Med-MMFL, a comprehensive benchmark for multimodal federated learning (FL) in the medical domain. It addresses the scarcity of existing medical FL benchmarks by evaluating six state-of-the-art FL algorithms across diverse modalities and tasks. The benchmark includes various medical data types and aims to facilitate reproducibility and fair comparison of future MMFL methods.

Read abstract

Federated learning (FL) enables collaborative model training across decentralized medical institutions while preserving data privacy. However, medical FL benchmarks remain scarce, with existing efforts focusing mainly on unimodal or bimodal modalities and a limited range of medical tasks. This gap underscores the need for standardized evaluation to advance systematic understanding in medical MultiModal FL (MMFL). To this end, we introduce Med-MMFL, the first comprehensive MMFL benchmark for the medical domain, encompassing diverse modalities, tasks, and federation scenarios. Our benchmark evaluates six representative state-of-the-art FL algorithms, covering different aggregation strategies, loss formulations, and regularization techniques. It spans datasets with 2 to 4 modalities, comprising a total of 10 unique medical modalities, including text, pathology images, ECG, X-ray, radiology reports, and multiple MRI sequences. Experiments are conducted across naturally federated, synthetic IID, and synthetic non-IID settings to simulate real-world heterogeneity. We assess segmentation, classification, modality alignment (retrieval), and VQA tasks. To support reproducibility and fair comparison of future multimodal federated learning (MMFL) methods under realistic medical settings, we release the complete benchmark implementation, including data processing and partitioning pipelines, at https://github.com/bhattarailab/Med-MMFL-Benchmark .

ARXIV Cancer: colorectal cancer Method: convolutional neural network

Enabling Real-Time Colonoscopic Polyp Segmentation on Commodity CPUs via Ultra-Lightweight Architecture

Weihao Gao, Zhuo Deng, Zheng Gong, Lan Ma
Published 2026-02-04 10:04

This paper presents the UltraSeg family of models designed for real-time polyp segmentation in colonoscopy, addressing the limitations of existing high-precision models that require GPUs. The UltraSeg models operate with extreme compression, achieving high performance on commodity CPUs while maintaining accuracy comparable to larger models. Evaluated on multiple datasets, these models provide a clinically viable solution for resource-constrained environments, facilitating early detection of colorectal cancer.

Read abstract

Early detection of colorectal cancer hinges on real-time, accurate polyp identification and resection. Yet current high-precision segmentation models rely on GPUs, making them impractical to deploy in primary hospitals, mobile endoscopy units, or capsule robots. To bridge this gap, we present the UltraSeg family, operating in an extreme-compression regime (<0.3 M parameters). UltraSeg-108K (0.108 M parameters) is optimized for single-center data, while UltraSeg-130K (0.13 M parameters) generalizes to multi-center, multi-modal images. By jointly optimizing encoder-decoder widths, incorporating constrained dilated convolutions to enlarge receptive fields, and integrating a cross-layer lightweight fusion module, the models achieve 90 FPS on a single CPU core without sacrificing accuracy. Evaluated on seven public datasets, UltraSeg retains >94% of the Dice score of a 31 M-parameter U-Net while utilizing only 0.4% of its parameters, establishing a strong, clinically viable baseline for the extreme-compression domain and offering an immediately deployable solution for resource-constrained settings. This work provides not only a CPU-native solution for colonoscopy but also a reproducible blueprint for broader minimally invasive surgical vision applications. Source code is publicly available to ensure reproducibility and facilitate future benchmarking.

Find the papers that actually matter