Research Papers

ARXIV Cancer: general cancer Method: Mixture-of-Experts

Sparse Spectral LoRA: Routed Experts for Medical VLMs

Omid Nejati Manzari, Hojat Asgariandehkordi, Taha Koleilat, Yiming Xiao, Hassan Rivaz
Published 2026-04-01 18:22

The paper presents MedQwen, a parameter-efficient medical vision-language model (VLM) designed to enhance robustness in medical imaging tasks. It employs a spectrally routed Mixture-of-Experts (MoE) approach to mitigate issues related to cross-dataset interference and catastrophic forgetting during continual training. The model demonstrates strong performance across 23 medical datasets, achieving results comparable to full fine-tuning while significantly reducing the number of trainable parameters and minimizing sequential forgetting.

Read abstract

Large vision-language models (VLMs) excel on general benchmarks but often lack robustness in medical imaging, where heterogeneous supervision induces cross-dataset interference and sensitivity to data regime (i.e., how the supervisory signals are mixed). In realistic clinical workflows, data and tasks arrive sequentially, so naive continual training further leads to catastrophic forgetting. To address these challenges, we propose MedQwen, a parameter-efficient medical VLM that couples a spectrally routed Mixture-of-Experts (MoE) with a theoretically grounded scaling rule that aligns low-rank updates with a full-rank, fully fine-tuned MoE, without changing the base architecture. Concretely, we initialize each expert from non-overlapping singular value decomposition (SVD) segments of the pretrained weight and introduce a residual compensation and scaling scheme to enable stable expert specialization and consistent routing under distribution shift. Across 23 medical datasets covering visual question answering, report generation, radiology classification, and hallucination mitigation, MedQwen achieves strong, reliable performance: it approaches full fine-tuning on zero-shot classification with 339$\times$ fewer trainable parameters, and reduces sequential forgetting to $\sim$5\% where strong baselines degrade by $>$20-50\%.

ARXIV Cancer: head-and-neck cancer Method: Iteratively Prompting and Pseudo-labeling

Foundation Model-guided Iteratively Prompting and Pseudo-Labeling for Partially Labeled Medical Image Segmentation

Qiaochu Zhao, Wei Wei, David Horowitz, Richard Bakst, Yading Yuan
Published 2026-04-01 15:45

This study presents IPnP, an Iteratively Prompting and Pseudo-labeling framework designed to enhance medical image segmentation in scenarios with partially labeled data. The method involves a collaboration between a trainable segmentation network and a frozen foundation model to iteratively generate and refine pseudo-labels for unlabeled organs. The results indicate that IPnP significantly improves segmentation performance on both public and private datasets, approaching the performance of fully labeled references.

Read abstract

Automated medical image segmentation has achieved remarkable progress with fully labeled data. However, site-specific clinical priorities and the high cost of manual annotation often yield scans with only a subset of organs labeled, leading to the partially labeled problem that degrades performance. To address this issue, we propose IPnP, an Iteratively Prompting and Pseudo-labeling framework, for partially labeled medical image segmentation. IPnP iteratively generates and refines pseudo-labels for unlabeled organs through collaboration between a trainable segmentation network (specialist) and a frozen foundation model (generalist), progressively recovering full-organ supervision. On the public dataset AMOS with the simulated partial-label setting, IPnP consistently improves segmentation performance over prior methods and approaches the performance of the fully labeled reference. We further evaluate on a private, partially labeled dataset of 210 head-and-neck cancer patients and demonstrate our effectiveness in real-world clinical settings.

ARXIV Cancer: prostate cancer Method: expectation-maximization algorithm

Maximizing T2-Only Prostate Cancer Localization from Expected Diffusion Weighted Imaging

Weixi Yi, Yipei Wang, Wen Yan, Hanyuan Zhang, Natasha Thorley, Alexander Ng, Shonit Punwani, Fernando Bianco, Mark Emberton, Veeru Kasivisvanathan, Dean C. Barratt, Shaheer U. Saeed, Yipeng Hu
Published 2026-04-01 14:50

This study explores the use of T2-weighted MRI images for the localization of prostate cancer, leveraging a latent diffusion-weighted imaging modality during training. The proposed method employs an expectation-maximization algorithm to enhance cancer localization performance, demonstrating significant improvements over traditional multi-sequence approaches. The results indicate that T2-only methods can achieve competitive diagnostic accuracy, with notable increases in patient-level and zone-level performance metrics.

Read abstract

Multiparametric MRI is increasingly recommended as a first-line noninvasive approach to detect and localize prostate cancer, requiring at minimum diffusion-weighted (DWI) and T2-weighted (T2w) MR sequences. Early machine learning attempts using only T2w images have shown promising diagnostic performance in segmenting radiologist-annotated lesions. Such uni-modal T2-only approaches deliver substantial clinical benefits by reducing costs and expertise required to acquire other sequences. This work investigates an arguably more challenging application using only T2w at inference, but to localize individual cancers based on independent histopathology labels. We formulate DWI images as a latent modality (readily available during training) to classify cancer presence at local Barzell zones, given only T2w images as input. In the resulting expectation-maximization algorithm, a latent modality generator (implemented using a flow matching-based generative model) approximates the latent DWI image posterior distribution in the E-steps, while in M-steps a cancer localizer is simultaneously optimized with the generative model to maximize the expected likelihood of cancer presence. The proposed approach provides a novel theoretical framework for learning from a privileged DWI modality, yielding superior cancer localization performance compared to approaches that lack training DWI images or existing frameworks for privileged learning and incomplete modalities. The proposed T2-only methods perform competitively or better than baseline methods using multiple input sequences (e.g., improving the patient-level F1 score by 14.4\% and zone-level QWK by 5.3\% over the T2w+DWI baseline). We present quantitative evaluations using internal and external datasets from 4,133 prostate cancer patients with histopathology-verified labels.

ARXIV Cancer: brain tumor Method: convolutional neural network

OkanNet: A Lightweight Deep Learning Architecture for Classification of Brain Tumor from MRI Images

Okan Uçar, Murat Kurt
Published 2026-04-01 13:29

This study presents OkanNet, a lightweight deep learning architecture designed for the classification of brain tumors from MRI images. Two approaches were compared: a custom Convolutional Neural Network (CNN) and a Transfer Learning method using ResNet-50. The ResNet-50 model achieved higher accuracy at 96.49%, while OkanNet provided a faster alternative suitable for mobile systems, with an accuracy of 88.10%. The research highlights the balance between model complexity and computational efficiency in medical imaging.

Read abstract

Medical imaging techniques, especially Magnetic Resonance Imaging (MRI), are accepted as the gold standard in the diagnosis and treatment planning of neurological diseases. However, the manual analysis of MRI images is a time-consuming process for radiologists and is prone to human error due to fatigue. In this study, two different Deep Learning approaches were developed and analyzed comparatively for the automatic detection and classification of brain tumors (Glioma, Meningioma, Pituitary, and No Tumor). In the first approach, a custom Convolutional Neural Network (CNN) architecture named "OkanNet", which has a low computational cost and fast training time, was designed from scratch. In the second approach, the Transfer Learning method was applied using the 50-layer ResNet-50 [1] architecture, pre-trained on the ImageNet dataset. In experiments conducted on an extended dataset compiled by Masoud Nickparvar containing a total of $7,023$ MRI images, the Transfer Learning-based ResNet-50 model exhibited superior classification performance, achieving $96.49\%$ Accuracy and $0.963$ Precision. In contrast, the custom OkanNet architecture reached an accuracy rate of $88.10\%$; however, it proved to be a strong alternative for mobile and embedded systems with limited computational power by yielding results approximately $3.2$ times faster ($311$ seconds) than ResNet-50 in terms of training time. This study demonstrates the trade-off between model depth and computational efficiency in medical image analysis through experimental data.

ARXIV Cancer: general cancer Method: transformer-based model

BioCOMPASS: Integrating Biomarkers into Transformer-Based Immunotherapy Response Prediction

Sayed Hashim, Frank Soboczenski, Paul Cairns
Published 2026-04-01 11:06

The paper presents BioCOMPASS, an extension of a transformer-based model designed to enhance immunotherapy response prediction by integrating biomarkers and treatment information. The study demonstrates that incorporating loss components aligned with the model's intermediate representations improves generalisability across diverse patient cohorts. Results indicate that this approach can significantly enhance the model's performance in predicting responses to immunotherapy.

Read abstract

Datasets used in immunotherapy response prediction are typically small in size, as well as diverse in cancer type, drug administered, and sequencer used. Models often drop in performance when tested on patient cohorts that are not included in the training process. Recent work has shown that transformer-based models along with self-supervised learning show better generalisation performance than threshold-based biomarkers, but is still suboptimal. We present BioCOMPASS, an extension of a transformer-based model called COMPASS, that integrates biomarkers and treatment information to further improve its generalisability. Instead of feeding biomarker data as input, we built loss components to align them with the model's intermediate representations. We found that components such as treatment gating and pathway consistency loss improved generalisability when evaluated with Leave-one-cohort-out, Leave-one-cancer-type-out and Leave-one-treatment-out strategies. Results show that building components that exploit biomarker and treatment information can help in generalisability of immunotherapy response prediction. Careful curation of additional components that leverage complementary clinical information and domain knowledge represents a promising direction for future research.

ARXIV Cancer: glioma Method: multimodal deep learning

Quantifying Cross-Modal Interactions in Multimodal Glioma Survival Prediction via InterSHAP: Evidence for Additive Signal Integration

Iain Swift, JingHua Ye, Ruairi O'Reilly
Published 2026-03-31 16:39

This study investigates the role of cross-modal interactions in multimodal deep learning for glioma survival prediction. By adapting the InterSHAP metric to Cox proportional hazards models, the authors analyze the performance of various fusion architectures that combine whole-slide images and RNA-seq features. The results indicate that higher predictive performance is associated with lower cross-modal interaction, suggesting that gains in performance are due to complementary signal aggregation rather than synergistic interactions.

Read abstract

Multimodal deep learning for cancer prognosis is commonly assumed to benefit from synergistic cross-modal interactions, yet this assumption has not been directly tested in survival prediction settings. This work adapts InterSHAP, a Shapley interaction index-based metric, from classification to Cox proportional hazards models and applies it to quantify cross-modal interactions in glioma survival prediction. Using TCGA-GBM and TCGA-LGG data (n=575), we evaluate four fusion architectures combining whole-slide image (WSI) and RNA-seq features. Our central finding is an inverse relationship between predictive performance and measured interaction: architectures achieving superior discrimination (C-index 0.64$\to$0.82) exhibit equivalent or lower cross-modal interaction (4.8\%$\to$3.0\%). Variance decomposition reveals stable additive contributions across all architectures (WSI${\approx}$40\%, RNA${\approx}$55\%, Interaction${\approx}$4\%), indicating that performance gains arise from complementary signal aggregation rather than learned synergy. These findings provide a practical model auditing tool for comparing fusion strategies, reframe the role of architectural complexity in multimodal fusion, and have implications for privacy-preserving federated deployment.

ARXIV Cancer: glioma Method: trimodal deep learning

Trimodal Deep Learning for Glioma Survival Prediction: A Feasibility Study Integrating Histopathology, Gene Expression, and MRI

Iain Swift, JingHua Ye
Published 2026-03-31 16:32

This study investigates the integration of histopathology, gene expression, and MRI data using a trimodal deep learning approach to predict survival in glioma patients. The research evaluates various unimodal, bimodal, and trimodal configurations, finding that trimodal early fusion yields a Composite Score of 0.854, although the improvement over the bimodal baseline is not statistically significant. The results suggest that incorporating additional imaging modalities may enhance prognostic accuracy, even in small cohorts.

Read abstract

Multimodal deep learning has improved prognostic accuracy for brain tumours by integrating histopathology and genomic data, yet the contribution of volumetric MRI within unified survival frameworks remains unexplored. This pilot study extends a bimodal framework by incorporating Fluid Attenuated Inversion Recovery (FLAIR) MRI from BraTS2021 as a third modality. Using the TCGA-GBMLGG cohort (664 patients), we evaluate three unimodal models, nine bimodal configurations, and three trimodal configurations across early, late, and joint fusion strategies. In this small cohort setting, trimodal early fusion achieves an exploratory Composite Score (CS = 0.854), with a controlled $Δ$CS of +0.011 over the bimodal baseline on identical patients, though this difference is not statistically significant (p = 0.250, permutation test). MRI achieves reasonable unimodal discrimination (CS = 0.755) but does not substantially improve bimodal pairs, while providing measurable uplift in the three-way combination. All MRI containing experiments are constrained to 19 test patients, yielding wide bootstrap confidence intervals (e.g. [0.400,1.000]) that preclude definitive conclusions. These findings provide preliminary evidence that a third imaging modality may add prognostic value even with limited sample sizes, and that additional modalities require sufficient multimodal context to contribute effectively.

ARXIV Cancer: unknown Method: nonnegative matrix factorization

Diffusion-Based Feature Denoising with NNMF for Robust handwritten digit multi-class classification

Hiba Adil Al-kharsan, Róbert Rajkó
Published 2026-03-31 15:57

This study introduces a robust framework for multi-class classification of handwritten digits, utilizing diffusion-driven feature denoising alongside a hybrid feature representation. The method employs Nonnegative Matrix Factorization (NNMF) for feature extraction and combines it with deep features from a convolutional neural network (CNN). The approach aims to enhance robustness against noise and adversarial attacks, demonstrating effective performance in both baseline and adversarial settings.

Read abstract

This work presents a robust multi-class classification framework for handwritten digits that combines diffusion-driven feature denoising with a hybrid feature representation. Inspired by our previous work on brain tumor classification, the proposed approach operates in a feature space to improve the robustness to noise and adversarial attacks. First, the input images are converted into tight, interpretable exemplification using Nonnegative Matrix Factorization (NNMF). In parallel, special deep features are extracted using a computational neural network (CNN). These integral features are combined into a united hybrid representation. To improve robustness, a step diffusion operation is used in the feature space by gradually adding Gaussian noise. A feature denoiser network is trained to reverse this operation and rebuild clean representations from tilted inputs. The courteous features are then applied for multi-class classification. The suggested method is evaluated in both baseline and adversarial settings using AutoAttack. The experimental outcome present that the diffusion-based hybrid model is both effective and robust, the CNN baseline models outperforming while maintain powerful classification performance. These results explain the activity of feature-level diffusion defense for reliable multi-class handwritten digit classification.

ARXIV Cancer: lung cancer Method: multimodal learning

Less Is More? Selective Visual Attention to High-Importance Regions for Multimodal Radiology Summarization

Mst. Fahmida Sultana Naznin, Adnan Ibney Faruq, Mushfiqur Rahman, Niloy Kumar Mondal, Md. Mehedi Hasan Shawon, Md Rakibul Hasan
Published 2026-03-31 15:47

This study investigates the effectiveness of selective visual attention in automated radiology report summarization, challenging the notion that more visual input is always beneficial. The authors introduce ViTAS, a novel pipeline that emphasizes pathology-relevant visual patches, leading to improved performance over traditional multimodal models. Results indicate that this approach yields state-of-the-art performance metrics and enhanced qualitative evaluations.

Read abstract

Automated radiology report summarization aims to distill verbose findings into concise clinical impressions, but existing multimodal models often struggle with visual noise and fail to meaningfully improve over strong text-only baselines in the FINDINGS $\to$ IMPRESSION transformation. We challenge two prevailing assumptions: (1) that more visual input is always better, and (2) that multimodal models add limited value when findings already contain rich image-derived detail. Through controlled ablations on MIMIC-CXR benchmark, we show that selectively focusing on pathology-relevant visual patches rather than full images yields substantially better performance. We introduce ViTAS, Visual-Text Attention Summarizer, a multi-stage pipeline that combines ensemble-guided MedSAM2 lung segmentation, bidirectional cross-attention for multi-view fusion, Shapley-guided adaptive patch clustering, and hierarchical visual tokenization feeding a ViT. ViTAS achieves SOTA results with 29.25% BLEU-4 and 69.83% ROUGE-L, improved factual alignment in qualitative analysis, and the highest expert-rated human evaluation scores. Our findings demonstrate that less but more relevant visual input is not only sufficient but superior for multimodal radiology summarization.

ARXIV Cancer: breast cancer Method: multimodal machine learning

Multimodal Machine Learning for Early Prediction of Metastasis in a Swedish Multi-Cancer Cohort

Franco Rugolon, Korbinian Randl, Braslav Jovanovic, Ioanna Miliou, Panagiotis Papapetrou
Published 2026-03-31 14:27

This study presents a framework utilizing multimodal machine learning to predict the risk of metastasis one month prior to diagnosis by analyzing six months of electronic health record data. The research involved data from four cancer cohorts and compared traditional and deep learning classifiers across various modalities. The results indicated that intermediate fusion strategies yielded the highest predictive performance for breast, colon, lung, and prostate cancers, with deep learning models outperforming traditional approaches.

Read abstract

Multimodal Machine Learning offers a holistic view of a patient's status, integrating structured and unstructured data from electronic health records (EHR). We propose a framework to predict metastasis risk one month prior to diagnosis, using six months of clinical history from EHR data. Data from four cancer cohorts collected at Karolinska University Hospital (Stockholm, Sweden) were analyzed: breast (n = 743), colon (n = 387), lung (n = 870), and prostate (n = 1890). The dataset included demographics, comorbidities, laboratory results, medications, and clinical text. We compared traditional and deep learning classifiers across single modalities and multimodal combinations, using various fusion strategies and a Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD) 2a design, with an 80-20 development-validation split to ensure a rigorous, repeatable evaluation. Performance was evaluated using AUROC, AUPRC, F1 score, sensitivity, and specificity. We then employed a multimodal adaptation of SHAP to analyze the classifiers' reasoning. Intermediate fusion achieved the highest F1 scores on breast (0.845), colon (0.786), and prostate cancer (0.845), demonstrating strong predictive performance. For lung cancer, the intermediate fusion achieved an F1 score of 0.819, while the text-only model achieved the highest, with an F1 score of 0.829. Deep learning classifiers consistently outperformed traditional models. Colon cancer, the smallest cohort, had the lowest performance, highlighting the importance of sufficient training data. SHAP analysis showed that the relative importance of modalities varied across cancer types. Fusion strategies offer distinct strengths and weaknesses. Intermediate fusion consistently delivered the best results, but strategy choices should align with data characteristics and organizational needs.

Find the papers that actually matter