Research Papers

ARXIV Cancer: brain tumor Method: ensemble model

Ensemble Learning with Sparse Hypercolumns

Julia Dietlmeier, Vayangi Ganepola, Oluwabukola G. Adegboro, Mayug Maniparambil, Claudia Mazo, Noel E. O'Connor
Published 2026-03-06 08:42

This study explores the use of ensemble learning with sparse hypercolumns derived from convolutional neural networks for image segmentation in brain tumor datasets. The authors address the computational challenges associated with dense hypercolumns by implementing stratified subsampling. Their experiments demonstrate that ensemble methods can achieve competitive performance, particularly highlighting the effectiveness of Logistic Regression in low-shot scenarios. The results indicate a significant improvement over traditional methods, suggesting the potential of this approach in cancer diagnostics.

Read abstract

Directly inspired by findings in biological vision, high-dimensional hypercolumns are feature vectors built by concatenating multi-scale activations of convolutional neural networks for a single image pixel location. Together with powerful classifiers, they can be used for image segmentation i.e. pixel classification. However, in practice, there are only very few works dedicated to the use of hypercolumns. One reason is the computational complexity of processing concatenated dense hypercolumns that grows linearly with the size $N$ of the training set. In this work, we address this challenge by applying stratified subsampling to the VGG16 based hypercolumns. Furthermore, we investigate the performance of ensemble learning on sparse hypercolumns. Our experiments on a brain tumor dataset show that stacking and voting ensembles deliver competitive performance, but in the extreme low-shot case of $N \leq 20$, a simple Logistic Regression classifier is the most effective method. For 10% stratified subsampling rate, our best average Dice score is 0.66 for $N=20$. This is a statistically significant improvement of 24.53% over the standard multi-scale UNet baseline ($p$-value = $[3.07e-11]$, Wilcoxon signed-rank test), which is less effective due to overfitting.

ARXIV Cancer: general cancer Method: multimodal learning

TumorChain: Interleaved Multimodal Chain-of-Thought Reasoning for Traceable Clinical Tumor Analysis

Sijing Li, Zhongwei Qiu, Jiang Liu, Wenqiao Zhang, Tianwei Lin, Yihan Xie, Jianxiang An, Boxiang Yun, Chenglin Yang, Jun Xiao, Guangyu Guo, Jiawen Yao, Wei Liu, Yuan Gao, Ke Yan, Weiwei Cao, Zhilin Zheng, Tony C. W. Mok, Kai Cao, Yu Shi, Jiuyu Zhang, Jian Zhou, Beng Chin Ooi, Yingda Xia, Ling Zhang
Published 2026-03-06 03:42

This paper presents TumorChain, a multimodal interleaved reasoning framework designed for clinical tumor analysis. The framework integrates 3D imaging encoders and clinical text understanding to enhance the accuracy and traceability of tumor assessments. Through a large-scale dataset and iterative reasoning processes, TumorChain demonstrates significant improvements in lesion detection and pathology classification, emphasizing its potential for reliable clinical applications.

Read abstract

Accurate tumor analysis is central to clinical radiology and precision oncology, where early detection, reliable lesion characterization, and pathology-level risk assessment guide diagnosis and treatment planning. Chain-of-Thought (CoT) reasoning is particularly important in this setting because it enables step-by-step interpretation from imaging findings to clinical impressions and pathology conclusions, improving traceability and reducing diagnostic errors. Here, we target the clinical tumor analysis task and build a large-scale benchmark that operationalizes a multimodal reasoning pipeline, spanning findings, impressions, and pathology predictions. We curate TumorCoT, a large-scale dataset of 1.5M CoT-labeled VQA instructions paired with 3D CT scans, with step-aligned rationales and cross-modal alignments along the trajectory from findings to impression to pathology, enabling evaluation of both answer accuracy and reasoning consistency. We further propose TumorChain, a multimodal interleaved reasoning framework that tightly couples 3D imaging encoders, clinical text understanding, and organ-level vision-language alignment. Through cross-modal alignment and iterative interleaved causal reasoning, TumorChain grounds visual evidence, aggregates conclusions, and issues pathology predictions after multiple rounds of self-refinement, improving traceability and reducing hallucination risk. Experiments show consistent improvements over strong baselines in lesion detection, impression generation, and pathology classification, and demonstrate strong generalization on the DeepTumorVQA benchmark. These results highlight the potential of multimodal reasoning for reliable and interpretable tumor analysis in clinical practice. Detailed information about our project can be found on our project homepage at https://github.com/ZJU4HealthCare/TumorChain.

ARXIV Cancer: unknown Method: recurrent state-space model

BLINK: Behavioral Latent Modeling of NK Cell Cytotoxicity

Iman Nematollahi, Jose Francisco Villena-Ossa, Alina Moter, Kiana Farhadyar, Gabriel Kalweit, Abhinav Valada, Toni Cathomen, Evelyn Ullrich, Maria Kalweit
Published 2026-03-05 12:29

The paper presents BLINK, a trajectory-based recurrent state-space model designed to analyze NK cell cytotoxicity in tumor interactions. By learning latent dynamics from NK-tumor interaction sequences, BLINK predicts apoptosis increments that lead to cytotoxic outcomes. The model demonstrates improved detection of cytotoxic outcomes and offers an interpretable representation of NK cell behavior over time.

Read abstract

Machine learning models of cellular interaction dynamics hold promise for understanding cell behavior. Natural killer (NK) cell cytotoxicity is a prominent example of such interaction dynamics and is commonly studied using time-resolved multi-channel fluorescence microscopy. Although tumor cell death events can be annotated at single frames, NK cytotoxic outcome emerges over time from cellular interactions and cannot be reliably inferred from frame-wise classification alone. We introduce BLINK, a trajectory-based recurrent state-space model that serves as a cell world model for NK-tumor interactions. BLINK learns latent interaction dynamics from partially observed NK-tumor interaction sequences and predicts apoptosis increments that accumulate into cytotoxic outcomes. Experiments on long-term time-lapse NK-tumor recordings show improved cytotoxic outcome detection and enable forecasting of future outcomes, together with an interpretable latent representation that organizes NK trajectories into coherent behavioral modes and temporally structured interaction phases. BLINK provides a unified framework for quantitative evaluation and structured modeling of NK cytotoxic behavior at the single-cell level.

ARXIV Cancer: prostate cancer Method: prototype-based weakly-supervised framework

Adaptive Prototype-based Interpretable Grading of Prostate Cancer

Riddhasree Bhattacharyya, Pallabi Dutta, Sushmita Mitra
Published 2026-03-05 08:42

This study presents a novel prototype-based weakly-supervised framework for the interpretable grading of prostate cancer using histopathology images. The proposed method aims to enhance the interpretability of deep learning models, which traditionally lack transparency in their decision-making processes. By employing a prototype-aware loss function and an attention-based dynamic pruning mechanism, the framework demonstrates improved performance and reliability in assisting pathologists. Validation on benchmark datasets indicates its potential as a trustworthy tool in clinical diagnostics.

Read abstract

Prostate cancer being one of the frequently diagnosed malignancy in men, the rising demand for biopsies places a severe workload on pathologists. The grading procedure is tedious and subjective, motivating the development of automated systems. Although deep learning has made inroads in terms of performance, its limited interpretability poses challenges for widespread adoption in high-stake applications like medicine. Existing interpretability techniques for prostate cancer classifiers provide a coarse explanation but do not reveal why the highlighted regions matter. In this scenario, we propose a novel prototype-based weakly-supervised framework for an interpretable grading of prostate cancer from histopathology images. These networks can prove to be more trustworthy since their explicit reasoning procedure mirrors the workflow of a pathologist in comparing suspicious regions with clinically validated examples. The network is initially pre-trained at patch-level to learn robust prototypical features associated with each grade. In order to adapt it to a weakly-supervised setup for prostate cancer grading, the network is fine-tuned with a new prototype-aware loss function. Finally, a new attention-based dynamic pruning mechanism is introduced to handle inter-sample heterogeneity, while selectively emphasizing relevant prototypes for optimal performance. Extensive validation on the benchmark PANDA and SICAP datasets confirms that the framework can serve as a reliable assistive tool for pathologists in their routine diagnostic workflows.

ARXIV Cancer: brain tumor Method: federated learning

Federated Modality-specific Encoders and Partially Personalized Fusion Decoder for Multimodal Brain Tumor Segmentation

Hong Liu, Dong Wei, Qian Dai, Xian Wu, Yefeng Zheng, Liansheng Wang
Published 2026-03-05 07:25

This paper presents a novel federated learning framework called FedMEPD, designed to address intermodal heterogeneity in multimodal brain tumor segmentation. The framework utilizes modality-specific encoders and partially personalized fusion decoders to optimize model training across participants with varying imaging modalities. Experimental results demonstrate that FedMEPD outperforms existing methods on the BraTS 2018 and 2020 benchmarks, highlighting its effectiveness in personalized and multimodal settings.

Read abstract

Most existing federated learning (FL) methods for medical image analysis only considered intramodal heterogeneity, limiting their applicability to multimodal imaging applications. In practice, some FL participants may possess only a subset of the complete imaging modalities, posing intermodal heterogeneity as a challenge to effectively training a global model on all participants' data. Meanwhile, each participant expects a personalized model tailored to its local data characteristics in FL. This work proposes a new FL framework with federated modality-specific encoders and partially personalized multimodal fusion decoders (FedMEPD) to address the two concurrent issues. Specifically, FedMEPD employs an exclusive encoder for each modality to account for the intermodal heterogeneity. While these encoders are fully federated, the decoders are partially personalized to meet individual needs -- using the discrepancy between global and local parameter updates to dynamically determine which decoder filters are personalized. Implementation-wise, a server with full-modal data employs a fusion decoder to fuse representations from all modality-specific encoders, thus bridging the modalities to optimize the encoders via backpropagation. Moreover, multiple anchors are extracted from the fused multimodal representations and distributed to the clients in addition to the model parameters. Conversely, the clients with incomplete modalities calibrate their missing-modal representations toward the global full-modal anchors via scaled dot-product cross-attention, making up for the information loss due to absent modalities. FedMEPD is validated on the BraTS 2018 and 2020 multimodal brain tumor segmentation benchmarks. Results show that it outperforms various up-to-date methods for multimodal and personalized FL, and its novel designs are effective.

ARXIV Cancer: general cancer Method: deep learning

Structure Observation Driven Image-Text Contrastive Learning for Computed Tomography Report Generation

Hong Liu, Dong Wei, Qiong Peng, Yawen Huang, Xian Wu, Yefeng Zheng, Liansheng Wang
Published 2026-03-05 07:07

This paper presents a novel two-stage framework for Computed Tomography Report Generation (CTRG) that automates the clinical radiology reporting process. The first stage involves structure-wise image-text contrasting using learnable visual queries to observe CT image structures, while the second stage focuses on selecting critical image patches for report generation. The proposed method shows significant improvements in clinical efficiency and establishes new state-of-the-art performance in CTRG.

Read abstract

Computed Tomography Report Generation (CTRG) aims to automate the clinical radiology reporting process, thereby reducing the workload of report writing and facilitating patient care. While deep learning approaches have achieved remarkable advances in X-ray report generation, their effectiveness may be limited in CTRG due to larger data volumes of CT images and more intricate details required to describe them. This work introduces a novel two-stage (structure- and report-learning) framework tailored for CTRG featuring effective structure-wise image-text contrasting. In the first stage, a set of learnable structure-specific visual queries observe corresponding structures in a CT image. The resulting observation tokens are contrasted with structure-specific textual features extracted from the accompanying radiology report with a structure-wise image-text contrastive loss. In addition, text-text similarity-based soft pseudo targets are proposed to mitigate the impact of false negatives, i.e., semantically identical image structures and texts from non-paired images and reports. Thus, the model learns structure-level semantic correspondences between CT images and reports. Further, a dynamic, diversity-enhanced negative queue is proposed to guide the network in learning to discriminate various abnormalities. In the second stage, the visual structure queries are frozen and used to select the critical image patch embeddings depicting each anatomical structure, minimizing distractions from irrelevant areas while reducing memory consumption. Also, a text decoder is added and trained for report generation.Our extensive experiments on two public datasets demonstrate that our framework establishes new state-of-the-art performance for CTRG in clinical efficiency, and its components are effective.

ARXIV Cancer: brain tumor Method: deep learning

Meta-D: Metadata-Aware Architectures for Brain Tumor Analysis and Missing-Modality Segmentation

SangHyuk Kim, Daniel Haehn, Sumientra Rampersad
Published 2026-03-05 04:54

The paper introduces Meta-D, an architecture designed to enhance brain tumor analysis by incorporating categorical scanner metadata into deep learning pipelines. The method improves feature extraction for 2D tumor detection and 3D missing-modality segmentation, achieving significant increases in F1-score and Dice scores. By leveraging metadata, the architecture stabilizes feature representations and optimizes model performance even in scenarios with missing data.

Read abstract

We present Meta-D, an architecture that explicitly leverages categorical scanner metadata such as MRI sequence and plane orientation to guide feature extraction for brain tumor analysis. We aim to improve the performance of medical image deep learning pipelines by integrating explicit metadata to stabilize feature representations. We first evaluate this in 2D tumor detection, where injecting sequence (e.g., T1, T2) and plane (e.g., axial) metadata dynamically modulates convolutional features, yielding an absolute increase of up to 2.62% in F1-score over image-only baselines. Because metadata grounds feature extraction when data are available, we hypothesize it can serve as a robust anchor when data are missing. We apply this to 3D missing-modality tumor segmentation. Our Transformer Maximizer utilizes metadata-based cross-attention to isolate and route available modalities, ensuring the network focuses on valid slices. This targeted attention improves brain tumor segmentation Dice scores by up to 5.12% under extreme modality scarcity while reducing model parameters by 24.1%.

ARXIV Cancer: kidney tumor Method: adaptive spatial weighting

LAW & ORDER: Adaptive Spatial Weighting for Medical Diffusion and Segmentation

Anugunj Naman, Ayushman Singh, Gaibo Zhang, Yaguang Zhang
Published 2026-03-05 04:20

This paper presents a novel approach to medical image analysis by introducing adaptive spatial weighting to improve segmentation and synthesis of training images. The proposed methods, Learnable Adaptive Weighter (LAW) and Optimal Region Detection with Efficient Resolution (ORDER), enhance the allocation of computational resources and segmentation efficiency. Experiments on polyp and kidney tumor datasets show significant improvements in generative quality and segmentation accuracy compared to baseline methods.

Read abstract

Medical image analysis relies on accurate segmentation, and benefits from controllable synthesis (of new training images). Yet both tasks of the cyclical pipeline face spatial imbalance: lesions occupy small regions against vast backgrounds. In particular, diffusion models have been shown to drift from prescribed lesion layouts, while efficient segmenters struggle on spatially uncertain regions. Adaptive spatial weighting addresses this by learning where to allocate computational resources. This paper introduces a pair of network adapters: 1) Learnable Adaptive Weighter (LAW) which predicts per-pixel loss modulation from features and masks for diffusion training, stabilized via a mix of normalization, clamping, and regularization to prevent degenerate solutions; and 2) Optimal Region Detection with Efficient Resolution (ORDER) which applies selective bidirectional skip attention at late decoder stages for efficient segmentation. Experiments on polyp and kidney tumor datasets demonstrate that LAW achieves 20% FID generative improvement over a uniform baseline (52.28 vs. 65.60), with synthetic data then improving downstream segmentation by 4.9% Dice coefficient (83.2% vs. 78.3%). ORDER reaches 6.0% Dice improvement on MK-UNet (81.3% vs. 75.3%) with 0.56 GFLOPs and just 42K parameters, remaining 730x smaller than the standard nnUNet.

ARXIV Cancer: unknown Method: vision-language model

Thinking with Gaze: Sequential Eye-Tracking as Visual Reasoning Supervision for Medical VLMs

Yiwei Li, Zihao Wu, Yanjun Lv, Hanqi Jiang, Weihang You, Zhengliang Liu, Dajiang Zhu, Xiang Li, Quanzheng Li, Tianming Liu, Lin Zhao
Published 2026-03-05 02:12

This study explores the use of sequential eye-tracking data as a supervision method for vision-language models (VLMs) in medical imaging tasks. By incorporating gaze tokens that predict gaze-selected image patch indices, the model is encouraged to mimic human-like visual reasoning. The experiments demonstrate significant improvements in performance on both in-domain and out-of-domain tasks, indicating the effectiveness of this approach in enhancing medical reasoning capabilities.

Read abstract

Vision--language models (VLMs) process images as visual tokens, yet their intermediate reasoning is often carried out in text, which can be suboptimal for visually grounded radiology tasks. Radiologists instead diagnose via sequential visual search; eye-tracking captures this process as time-ordered gaze trajectories that reveal how evidence is acquired over time. We use eye-gaze as supervision to guide VLM reasoning by introducing a small set of dedicated gaze tokens. These tokens are trained to predict gaze-selected image patch indices in temporal order, encouraging the model to follow human-like evidence acquisition and integration. Experiments on MIMIC-EYE and multiple external zero-shot benchmarks show consistent gains over baselines, achieving state-of-the-art in-domain performance and improved out-of-domain robustness. These results highlight temporally ordered gaze as an effective supervision signal for learning visually grounded medical reasoning.

ARXIV Cancer: general cancer Method: dual-LoRA diffusion

Structure-Guided Histopathology Synthesis via Dual-LoRA Diffusion

Xuan Xu, Prateek Prasanna
Published 2026-03-04 19:50

This paper presents Dual-LoRA Controllable Diffusion, a novel framework for histopathology image synthesis that integrates local structure completion and global structure synthesis. The method utilizes multi-class nuclei centroids as spatial priors to enhance tissue synthesis under varying degrees of image absence. Experimental results show significant improvements in structural fidelity and realism compared to existing generative models, indicating its effectiveness for pan-cancer histopathology modeling.

Read abstract

Histopathology image synthesis plays an important role in tissue restoration, data augmentation, and modeling of tumor microenvironments. However, existing generative methods typically address restoration and generation as separate tasks, although both share the same objective of structure-consistent tissue synthesis under varying degrees of missingness, and often rely on weak or inconsistent structural priors that limit realistic cellular organization. We propose Dual-LoRA Controllable Diffusion, a unified centroid-guided diffusion framework that jointly supports Local Structure Completion and Global Structure Synthesis within a single model. Multi-class nuclei centroids serve as lightweight and annotation-efficient spatial priors, providing biologically meaningful guidance under both partial and complete image absence. Two task-specific LoRA adapters specialize the shared backbone for local and global objectives without retraining separate diffusion models. Extensive experiments demonstrate consistent improvements over state-of-the-art GAN and diffusion baselines across restoration and synthesis tasks. For local completion, LPIPS computed within the masked region improves from 0.1797 (HARP) to 0.1524, and for global synthesis, FID improves from 225.15 (CoSys) to 76.04, indicating improved structural fidelity and realism. Our approach achieves more faithful structural recovery in masked regions and substantially improved realism and morphology consistency in full synthesis, supporting scalable pan-cancer histopathology modeling.

Find the papers that actually matter