Research Papers

ARXIV Cancer: general cancer Method: self-supervised learning

LEMON: a foundation model for nuclear morphology in Computational Pathology

Loïc Chadoutaud, Alice Blondel, Hana Feki, Jacqueline Fontugne, Emmanuel Barillot, Thomas Walter
Published 2026-03-26 18:09

The paper introduces LEMON, a self-supervised foundation model designed for scalable single-cell image representation learning in computational pathology. It addresses the gap in representation learning at the single-cell level, which is crucial for understanding cell types and phenotypes. LEMON is trained on millions of cell images from various tissues and cancer types, demonstrating strong performance across multiple benchmark datasets.

Read abstract

Computational pathology relies on effective representation learning to support cancer research and precision medicine. Although self-supervised learning has driven major progress at the patch and whole-slide image levels, representation learning at the single-cell level remains comparatively underexplored, despite its importance for characterizing cell types and cellular phenotypes. We introduce LEMON (Learning Embeddings from Morphology Of Nuclei), a self-supervised foundation model for scalable single-cell image representation learning. Trained on millions of cell images from diverse tissues and cancer types, LEMON learns robust and versatile morphological representations that support large-scale single-cell analyses in pathology. We evaluate LEMON on five benchmark datasets across a range of prediction tasks and show that it provides strong performance, highlighting its potential as a new paradigm for cell-level computational pathology. Model weights are available at https://huggingface.co/aliceblondel/LEMON.

ARXIV Cancer: colon cancer Method: multimodal large language models

Colon-Bench: An Agentic Workflow for Scalable Dense Lesion Annotation in Full-Procedure Colonoscopy Videos

Abdullah Hamdi, Changchun Yang, Xin Gao
Published 2026-03-26 16:58

The paper presents Colon-Bench, a novel multi-stage agentic workflow designed to create a comprehensive dataset for annotating full-procedure colonoscopy videos, addressing the lack of densely annotated datasets in colon cancer screening. The dataset includes 528 videos and various lesion categories, enabling rigorous evaluation of state-of-the-art Multimodal Large Language Models (MLLMs) in tasks such as lesion classification and video Visual Question Answering (VQA). The results indicate significant localization performance improvements in medical domains, alongside the introduction of a new prompting strategy to enhance MLLM performance.

Read abstract

Early screening via colonoscopy is critical for colon cancer prevention, yet developing robust AI systems for this domain is hindered by the lack of densely annotated, long-sequence video datasets. Existing datasets predominantly focus on single-class polyp detection and lack the rich spatial, temporal, and linguistic annotations required to evaluate modern Multimodal Large Language Models (MLLMs). To address this critical gap, we introduce Colon-Bench, generated via a novel multi-stage agentic workflow. Our pipeline seamlessly integrates temporal proposals, bounding-box tracking, AI-driven visual confirmation, and human-in-the-loop review to scalably annotate full-procedure videos. The resulting verified benchmark is unprecedented in scope, encompassing 528 videos, 14 distinct lesion categories (including polyps, ulcers, and bleeding), over 300,000 bounding boxes, 213,000 segmentation masks, and 133,000 words of clinical descriptions. We utilize Colon-Bench to rigorously evaluate state-of-the-art MLLMs across lesion classification, Open-Vocabulary Video Object Segmentation (OV-VOS), and video Visual Question Answering (VQA). The MLLM results demonstrate surprisingly high localization performance in medical domains compared to SAM-3. Finally, we analyze common VQA errors from MLLMs to introduce a novel "colon-skill" prompting strategy, improving zero-shot MLLM performance by up to 9.7% across most MLLMs. The dataset and the code are available at https://abdullahamdi.com/colon-bench .

ARXIV Cancer: colorectal cancer Method: dictionary-based hierarchical pathology mining

Dictionary-based Pathology Mining with Hard-instance-assisted Classifier Debiasing for Genetic Biomarker Prediction from WSIs

Ling Zhang, Boxiang Yun, Ting Jin, Qingli Li, Xinxing Li, Yan Wang
Published 2026-03-26 14:23

This study presents a novel framework called D2Bio for predicting genetic biomarkers, specifically microsatellite instability in colorectal cancer. The framework addresses challenges in constructing pathology-aware representations and mitigating overfitting in whole slide images (WSIs). The proposed method demonstrates a significant improvement in predictive performance, achieving over 4% enhancement in AUROC on the TCGA-CRC-MSI cohort. Additionally, the framework shows potential for clinical interpretability and utility in survival analysis.

Read abstract

Prediction of genetic biomarkers, e.g., microsatellite instability in colorectal cancer is crucial for clinical decision making. But, two primary challenges hamper accurate prediction: (1) It is difficult to construct a pathology-aware representation involving the complex interconnections among pathological components. (2) WSIs contain a large proportion of areas unrelated to genetic biomarkers, which make the model easily overfit simple but irrelative instances. We hereby propose a Dictionary-based hierarchical pathology mining with hard-instance-assisted classifier Debiasing framework to address these challenges, dubbed as D2Bio. Our first module, dictionary-based hierarchical pathology mining, is able to mine diverse and very fine-grained pathological contextual interaction without the limit to the distances between patches. The second module, hard-instance-assisted classfier debiasing, learns a debiased classifier via focusing on hard but task-related features, without any additional annotations. Experimental results on five cohorts show the superiority of our method, with over 4% improvement in AUROC compared with the second best on the TCGA-CRC-MSI cohort. Our analysis further shows the clinical interpretability of D2Bio in genetic biomarker diagnosis and potential clinical utility in survival analysis. Code will be available at https://github.com/DeepMed-Lab-ECNU/D2Bio.

ARXIV Cancer: prostate cancer Method: Bayesian Gamma-power-mixture survival regression model

A Bayesian Gamma-power-mixture survival regression model: predicting the recurrence of prostate cancer post-prostatectomy

Tommy Walker Mackay, Mingtong Xu, Shahrokh F. Shariat, Roger Sewell
Published 2026-03-26 13:52

This study presents a Bayesian Gamma-power-mixture survival regression model to predict the recurrence of prostate cancer following radical prostatectomy. The model was applied to a dataset of 423 patients, demonstrating a significant improvement in apparent Shannon information (ASI) when using pre-operative blood tests compared to previous models. The findings suggest that specific biomarkers are crucial for enhancing predictive accuracy.

Read abstract

In a dataset of 423 patients who had had radical prostatectomy for localised prostate cancer we estimated the apparent Shannon information (ASI) about time to biochemical recurrence in various subsets of the available pre-op variables using a Bayesian Gamma-power-mixture survival regression model. In all the subsets examined the ASI was positive with posterior probability greater than 0.975 . Using only age and results of pre-operative blood tests (PSA and biomarkers) we achieved 0.232 (0.180 to 0.290) nats ASI (0.335 (0.260 to 0.419) bits) (posterior mean and equitailed 95% posterior confidence intervals). This is more than double the mean posterior ASI previously achieved on the same dataset by a subset of the current authors using a log-skew-Student-mixture model, and is greater than that previous value with posterior probability greater than 0.99 . Additionally using pre- or post-operative Gleason grades, operative findings, clinical stage, and presence or absence of extraprostatic extension or seminal vesicle invasion did not increase the ASI extracted. However removing the blood-based biomarkers and replacing them with either pre-operative Gleason grades or findings available from MRI scanning greatly reduced the available ASI to respectively 0.077 (0.038 to 0.120) and 0.088 (0.045 to 0.132) nats (both less than the values using blood-based biomarkers with posterior probability greater than 0.995). A greedy approach to selection of the best biomarkers gave TGFbeta1, VCAM1, IL6sR, and uPA in descending order of importance from those examined.

ARXIV Cancer: unknown Method: self-supervised learning

Self-Supervised Learning for Knee Osteoarthritis: Diagnostic Limitations and Prognostic Value of Hospital Data

Haresh Rengaraj Rajamohan, Yuxuan Chen, Kyunghyun Cho, Cem M. Deniz
Published 2026-03-26 00:33

This study evaluates the effectiveness of self-supervised learning (SSL) in modeling knee osteoarthritis (OA) for diagnostic and prognostic purposes. The research compares image-only SSL and multimodal image-text SSL using hospital data. While SSL showed mixed results for diagnostic accuracy, it significantly improved prognostic modeling outcomes. The findings indicate that the alignment between pretraining data and the diagnostic task is crucial for performance.

Read abstract

This study assesses whether self-supervised learning (SSL) improves knee osteoarthritis (OA) modeling for diagnosis and prognosis relative to ImageNet-pretrained initialization. We compared (i) image-only SSL pretrained on knee radiographs from the OAI, MOST, and NYU cohorts, and (ii) multimodal image-text SSL pretrained on hospital knee radiographs paired with radiologist impressions. For diagnostic Kellgren-Lawrence (KL) grade prediction, SSL yielded mixed results. While image-only SSL improved accuracy during linear probing (frozen encoder), it did not outperform ImageNet pretraining during full fine-tuning. Similarly, multimodal SSL failed to improve grading performance. A likely explanation is mismatch between the hospital pretraining corpus and the downstream diagnostic task: the hospital image-text dataset was restricted to knees from patients with clinically identified OA in routine care, rather than a cohort spanning the full spectrum from normal to severe disease needed for balanced KL grading. In addition, radiology impressions do not explicitly encode KL grade, limiting supervision for learning KL-specific decision boundaries. In contrast, this same multimodal initialization significantly improved prognostic modeling. It outperformed ImageNet baselines in predicting 4-year structural incidence and progression, including on external validation (MOST AUROC: 0.701 vs. 0.599 at 10\% labeled data). Overall, these results suggest that our hospital image-text data may be less effective for diagnostic grading when the pretraining cohort is limited to OA knees, but can provide a strong signal for prognostic modeling when the downstream task is better aligned with the pretraining data distribution.

ARXIV Cancer: pituitary tumor Method: self-supervised representation learning

SurgPhase: Time efficient pituitary tumor surgery phase recognition via an interactive web platform

Yan Meng, Jack Cook, X. Y. Han, Kaan Duman, Shauna Otto, Dhiraj Pangal, Jonathan Chainey, Ruth Lau, Margaux Masson-Forsythe, Daniel A. Donoho, Danielle Levy, Gabriel Zada, Sébastien Froelich, Juan Fernandez-Miranda, Mike Chang
Published 2026-03-26 00:22

This study presents a framework for recognizing surgical phases in pituitary tumor surgery videos, utilizing self-supervised representation learning and robust temporal modeling. The method achieves 90% accuracy on a test set, surpassing existing approaches and demonstrating strong generalization. A collaborative online platform is introduced to facilitate video uploads, automated phase analysis, and data sharing among surgeons, enhancing model improvement and data collection.

Read abstract

Accurate surgical phase recognition is essential for analyzing procedural workflows, supporting intraoperative decision-making, and enabling data-driven improvements in surgical education and performance evaluation. In this work, we present a comprehensive framework for phase recognition in pituitary tumor surgery (PTS) videos, combining self-supervised representation learning, robust temporal modeling, and scalable data annotation strategies. Our method achieves 90\% accuracy on a held-out test set, outperforming current state-of-the-art approaches and demonstrating strong generalization across variable surgical cases. A central contribution of this work is the integration of a collaborative online platform designed for surgeons to upload surgical videos, receive automated phase analysis, and contribute to a growing dataset. This platform not only facilitates large-scale data collection but also fosters knowledge sharing and continuous model improvement. To address the challenge of limited labeled data, we pretrain a ResNet-50 model using the self-supervised framework on 251 unlabeled PTS videos, enabling the extraction of high-quality feature representations. Fine-tuning is performed on a labeled dataset of 81 procedures using a modified training regime that incorporates focal loss, gradual layer unfreezing, and dynamic sampling to address class imbalance and procedural variability.

ARXIV Cancer: brain tumors Method: multimodal large language models

NeuroVLM-Bench: Evaluation of Vision-Enabled Large Language Models for Clinical Reasoning in Neurological Disorders

Katarina Trojachanec Dineva, Stefan Andonov, Ilinka Ivanoska, Ivan Kitanovski, Sasho Gramatikov, Tamara Kostova, Monika Simjanoska Misheva, Kostadin Mishev
Published 2026-03-25 22:26

This study evaluates the performance of vision-enabled large language models in the context of neuroimaging for various neurological disorders. A comprehensive benchmarking approach is employed using MRI and CT datasets to assess diagnostic reasoning and classification tasks. The results indicate that while tumor classification is reliable, challenges remain in diagnosing multiple sclerosis and rare abnormalities. The study highlights the trade-offs between performance and efficiency in these models.

Read abstract

Recent advances in multimodal large language models enable new possibilities for image-based decision support. However, their reliability and operational trade-offs in neuroimaging remain insufficiently understood. We present a comprehensive benchmarking study of vision-enabled large language models for 2D neuroimaging using curated MRI and CT datasets covering multiple sclerosis, stroke, brain tumors, other abnormalities, and normal controls. Models are required to generate multiple outputs simultaneously, including diagnosis, diagnosis subtype, imaging modality, specialized sequence, and anatomical plane. Performance is evaluated across four directions: discriminative classification with abstention, calibration, structured-output validity, and computational efficiency. A multi-phase framework ensures fair comparison while controlling for selection bias. Across twenty frontier multimodal models, the results show that technical imaging attributes such as modality and plane are nearly solved, whereas diagnostic reasoning, especially subtype prediction, remains challenging. Tumor classification emerges as the most reliable task, stroke is moderately solvable, while multiple sclerosis and rare abnormalities remain difficult. Few-shot prompting improves performance for several models but increases token usage, latency, and cost. Gemini-2.5-Pro and GPT-5-Chat achieve the strongest overall diagnostic performance, while Gemini-2.5-Flash offers the best efficiency-performance trade-off. Among open-weight architectures, MedGemma-1.5-4B demonstrates the most promising results, as under few-shot prompting, it approaches the zero-shot performance of several proprietary models, while maintaining perfect structured output. These findings provide practical insights into performance, reliability, and efficiency trade-offs, supporting standardized evaluation of multimodal LLMs in neuroimaging.

ARXIV Cancer: brain tumor Method: deep learning

PhyDCM: A Reproducible Open-Source Framework for AI-Assisted Brain Tumor Classification from Multi-Sequence MRI

Hayder Saad Abdulbaqi, Mohammed Hadi Rahim, Mohammed Hassan Hadi, Haider Ali Aboud, Ali Hussein Allawi
Published 2026-03-25 20:58

PhyDCM is an open-source framework designed for AI-assisted classification of brain tumors using multi-sequence MRI data. It integrates a hybrid classification architecture based on MedViT and emphasizes reproducibility and modularity in its design. The framework achieves over 93% classification accuracy in experimental evaluations, demonstrating its effectiveness in automated brain tumor detection.

Read abstract

MRI-based medical imaging has become indispensable in modern clinical diagnosis, particularly for brain tumor detection. However, the rapid growth in data volume poses challenges for conventional diagnostic approaches. Although deep learning has shown strong performance in automated classification, many existing solutions are confined to closed technical architectures, limiting reproducibility and further academic development. PhyDCM is introduced as an open-source software framework that integrates a hybrid classification architecture based on MedViT with standardized DICOM processing and an interactive desktop visualization interface. The system is designed as a modular digital library that separates computational logic from the graphical interface, allowing independent modification and extension of components. Standardized preprocessing, including intensity rescaling and limited data augmentation, ensures consistency across varying MRI acquisition settings. Experimental evaluation on MRI datasets from BRISC2025 and curated Kaggle collections (FigShare, SARTAJ, and Br35H) demonstrates stable diagnostic performance, achieving over 93% classification accuracy across categories. The framework supports structured, exportable outputs and multi-planar reconstruction of volumetric data. By emphasizing transparency, modularity, and accessibility, PhyDCM provides a practical foundation for reproducible AI-driven medical image analysis, with flexibility for future integration of additional imaging modalities.

ARXIV Cancer: unknown Method: visual language model

MedOpenClaw and MedFlowBench: Auditing Medical Agents in Full-Study Workflows

Weixiang Shen, Chengzhi Shen, Yanzhu Hu, Che Liu, Junde Wu, Jiayuan Zhu, Xiao Han, Zongyue Li, Jingpei Wu, Min Xu, Daguang Xu, Yueming Jin, Benedikt Wiestler, Daniel Rueckert, Jiazhen Pan
Published 2026-03-25 17:33

This paper introduces MedFlowBench, a benchmark designed to evaluate visual language models (VLMs) in the context of complete medical imaging studies rather than isolated images. It highlights the importance of auditing evidence produced by these agents during complex workflows, revealing that performance significantly declines when agents are required to support their answers with structured evidence. The study emphasizes the challenges agents face in managing viewer states and verifying outputs across multiple steps.

Read abstract

Medical imaging benchmarks often evaluate VLMs on pre-selected 2D images, slices, crops, or patches, making evaluation closer to visual recognition. Real clinical workflows impose a different burden: readers must search through complete studies, operate imaging software, navigate across slices and magnifications, and document visual evidence that can be audited. We argue that this evidence-producing workflow is a critical missing evaluation axis for medical imaging agents. To study it, we introduce MedFlowBench, a full-study benchmark for VLM agents, together with MedOpenClaw, a controlled and replayable runtime in which agents operate medical imaging viewers such as 3D Slicer and QuPath. In each episode, an agent inspects a complete radiology study or whole-slide pathology image, returns a task answer, and submits structured evidence, including key slices, coordinates, regions of interest, or lesion-state fields. This evidence is automatically checked against withheld masks, annotations, and labels. Across evaluated models, final answer-only scoring gives an overly optimistic picture: when answers must also be supported by correct evidence, performance drops substantially on complex workflows. We further find that adding image-analysis tools does not by itself solve the problem. Tools help when they make a complex procedure simple and reliable, but agents still struggle when they must choose inputs, manage viewer state, and verify intermediate outputs over multiple steps. MedFlowBench exposes whether medical imaging agents can produce auditable evidence from complete studies, rather than plausible answers from selected images.

ARXIV Cancer: cervical cancer Method: ensemble model

Detection and Classification of (Pre)Cancerous Cells in Pap Smears: An Ensemble Strategy for the RIVA Cervical Cytology Challenge

Lautaro Kogan, María Victoria Ríos
Published 2026-03-24 22:03

This study focuses on the automated detection and classification of cervical cells in Pap smear images to enhance cervical cancer screening. The authors employed an ensemble strategy using YOLOv11m as the base architecture, incorporating loss reweighting, data resampling, and transfer learning to address challenges such as class imbalance and nuclear overlap. The ensemble model demonstrated a significant improvement in detection performance compared to individual models.

Read abstract

Automated detection and classification of cervical cells in conventional Pap smear images can strengthen cervical cancer screening at scale by reducing manual workload, improving triage, and increasing consistency across readers. However, it is challenged by severe class imbalance and frequent nuclear overlap. We present our approach to the RIVA Cervical Cytology Challenge (ISBI 2026), which requires multi-class detection of eight Bethesda cell categories under these conditions. Using YOLOv11m as the base architecture, we systematically evaluate three strategies to improve detection performance: loss reweighting, data resampling and transfer learning. We build an ensemble by combining models trained under each strategy, promoting complementary detection behavior and combining them through Weighted Boxes Fusion (WBF). The ensemble achieves a mAP50-95 of 0.201 on the preliminary test set and 0.147 on the final test set, representing a 29% improvement over the best individual model on the final test set and demonstrating the effectiveness of combining complementary imbalance mitigation strategies.

Find the papers that actually matter