Research Papers

ARXIV Cancer: general cancer Method: large language model ensemble

Whole-body CT attenuation and volume charts from routine clinical scans via evidence-grounded LLM report filtering

Christian Wachinger, Bernhard Renger, Christopher Späth, Jan Kirschke, Marcus Makowski
Published 2026-05-07 09:40

This study develops an evidence-grounded large language model (LLM) ensemble to filter pathological findings from radiology reports, allowing for the creation of pathology-reduced cohorts from over 350,000 CT examinations. The approach utilizes distribution-aware generalized additive models to establish comprehensive whole-body reference charts for anatomical structures, facilitating standardized quantitative phenotyping and multi-site imaging studies.

Read abstract

Interpreting quantitative CT biomarkers, such as organ volume and tissue attenuation, requires large-scale healthy reference distributions. However, creating these is challenging because clinical datasets are often heavily enriched with pathology. Here, we develop an evidence-grounded, cross-verified large language model (LLM) ensemble to filter pathological findings from radiology reports, enabling the construction of pathology-reduced cohorts from over 350,000 CT examinations. Five LLMs, first, flag structure-level abnormality candidates grounded in verbatim report evidence and, second, resolve disagreements via cross-verification. Using distribution-aware generalized additive models for location, scale, and shape, we establish comprehensive whole-body reference charts for 106 anatomical structures (volumes and attenuation) across adulthood, accounting for age, sex, contrast enhancement, and acquisition parameters. Longitudinal analyses reveal structure- and contrast-dependent changes distinct from cross-sectional trends. These resources facilitate covariate-adjusted centile scoring from routine CT, supporting standardized quantitative phenotyping, multi-site imaging studies, and scalable opportunistic screening research.

ARXIV Cancer: rectal cancer Method: hierarchical shifted-window transformer

Tumor-aware augmentation with task-guided attention analysis improves rectal cancer segmentation from magnetic resonance images

Aneesh Rangnekar, Joao Miranda, Natally Horvat, Stephanie Chahwan, Samir Alrayess, Aditya Apte, Aditi Iyer, Eve LoCastro, Revathi Ravella, Marc J Gollub, Iva Petkovska, Jesse Joshua Smith, Paul Romesser, Julio Garcia-Aguilar, Harini Veeraraghavan, Joseph Deasy
Published 2026-05-06 23:46

This study investigates the challenges of transferring pretrained transformer models from CT to MRI for rectal cancer segmentation. It identifies two main failure modes related to token usage and feature adaptation, which lead to accuracy degradation despite fine-tuning. The authors propose a tumor-aware augmentation strategy and an anisotropic cropping method to enhance model performance, achieving improved detection rates in rectal MRI datasets.

Read abstract

Pretraining on large-scale datasets has been shown to improve transformer generalizability, even for out-of-domain (OOD) modalities and tasks. However, two common assumptions often fail under OOD transfer: that downstream datasets can be adapted to the fixed input geometry of pretrained models and that pretrained representations transfer effectively across imaging modalities. We show that these assumptions break down through two interacting failure modes in CT-to-MRI transfer: inefficient token usage caused by zero-padding to match pretrained input dimensions and ineffective feature adaptation. These failures led to accuracy degradation despite extensive fine-tuning. We investigated these failure modes using two CT-pretrained hierarchical shifted-window transformer backbones, SMIT and Swin UNETR, pretrained with different objectives and datasets. Mechanistic analysis introduced an attention dilution index (ADI), an entropy-based metric quantifying attention diverted toward uninformative padding tokens, and centered kernel alignment (CKA) to measure feature reuse in MRI tasks. ADI increased with zero-padding, while high feature reuse did not necessarily correspond to improved accuracy. To mitigate these issues, we introduced two interventions: a tumor-aware augmentation strategy to improve tumor appearance heterogeneity coverage and an anisotropic cropping strategy to restore token efficiency. Fine-tuning on identical rectal MRI datasets improved detection rates to 224/247 (90.7%) for SMIT and 219/247 (88.7%) for Swin UNETR, demonstrating improved robustness under CT-to-MRI transfer. This study is among the first to examine when pretrained transformers fail to transfer effectively across imaging modalities and how simple mitigation strategies, motivated by mechanistic analysis of datasets, can reduce transfer limitations while improving robustness and MRI detection.

ARXIV Cancer: general cancer Method: structured state space sequence model

Geometry-Aware State Space Model: A New Paradigm for Whole-Slide Image Representation

Enhui Chai, Sicheng Chen, Tianyi Zhang, Chad Wong, Kecheng Huang, Zeyu Liu, Fei Xia
Published 2026-05-06 17:33

This paper presents BatMIL, a whole-slide image classification framework that utilizes a hybrid hyperbolic-Euclidean representation to improve the analysis of histopathological images. The method addresses limitations of existing Multiple Instance Learning approaches by incorporating a structured state space sequence model and a mixture-of-experts module to enhance representational capacity. Extensive experiments across multiple datasets show that BatMIL outperforms current state-of-the-art methods in slide-level classification tasks.

Read abstract

Accurate analysis of histopathological images is critical for disease diagnosis and treatment planning. Whole-slide images (WSIs), which digitize tissue specimens at gigapixel resolution, are fundamental to this process but require aggregating thousands of patches for slide-level predictions. Multiple Instance Learning (MIL) tackles this challenge with a two-stage paradigm, decoupling tile-level embedding and slide-level prediction. However, most existing methods implicitly embed patch representations in homogeneous Euclidean spaces, overlooking the hierarchical organization and regional heterogeneity of pathological tissues. This limits current models' ability to capture global tissue architecture and fine-grained cellular morphology. To address this limitation, we introduce a hybrid hyperbolic-Euclidean representation that embeds WSI features in dual geometric spaces, enabling complementary modeling of hierarchical tissue structures and local morphological details. Building on this formulation, we develop BatMIL, a WSI classification framework that leverages both geometric spaces. To model long-range dependencies among thousands of patches, we employ a structured state space sequence model (S4) backbone that encodes patch sequences with linear computational complexity. Furthermore, to account for regional heterogeneity, we introduce a chunk-level mixture-of-experts (MoE) module that groups patches into regions and dynamically routes them to specialized subnetworks, improving representational capacity while reducing redundant computation. Extensive experiments on seven WSI datasets spanning six cancer types demonstrate that BatMIL consistently outperforms state-of-the-art MIL approaches in slide-level classification tasks. These results indicate that geometry-aware representation learning offers a promising direction for next-generation computational pathology.

ARXIV Cancer: breast cancer Method: deep learning

External Validation of Deep Learning Models for BI-RADS Breast Density Prediction from Ultrasound Images

Yuxuan Chen, Arianna Bunnell, Yanqi Xu, Haoyan Yang, Thomas K. Wolfgruber, John A. Shepherd, Yiqiu Shen
Published 2026-05-06 16:19

This study externally validated three deep learning models for predicting mammographic breast density from breast ultrasound exams using an independent cohort of 2,000 ultrasound exams. The models demonstrated strong performance, particularly in extremely dense breasts, with DenseNet121 achieving the highest overall performance. The findings suggest that deep learning models can generalize well to diverse external data for breast density assessment, although challenges remain for heterogeneously dense breasts.

Read abstract

We externally validated three deep learning models (DenseNet121, ViT-B/32, and ResNet50) for predicting mammographic breast density from breast ultrasound exams on an independent cohort. The external validation set comprised 2,000 ultrasound exams, including 500 cancer cases defined by an initial negative exam (BI-RADS 1 or 2) followed by a cancer diagnosis within 6 months to 10 years, and 1,500 negative controls matched by manufacturer and study year. Performance was measured using patient-level AUROC across four density categories: A (fatty), B (scattered), C (heterogeneous), and D (extremely dense). As a downstream assessment, we also evaluated 10-year risk prediction by incorporating age and AI-derived density into the Tyrer-Cuzick model and comparing performance against a reference model using age and mammography-reported density. All three models performed best in extremely dense breasts (AUROC 0.868-0.899), with strong performance in fatty (0.814-0.838) and scattered density (0.764-0.799), and lower performance in heterogeneously dense breasts (0.699-0.729). DenseNet121 achieved the highest overall performance (micro-averaged AUROC 0.885), and performance across categories was comparable between internal and external testing. For risk modeling, age combined with AI-derived density yielded a lower AUROC than age combined with mammography-reported density (0.541 vs. 0.570; p = 0.23), with no statistically significant difference. These findings indicate that deep learning models generalize well to external data with different racial composition for breast density assessment. While performance is strongest in extremely dense breasts, heterogeneously dense remains more challenging, highlighting the need for targeted optimization.

ARXIV Cancer: general cancer Method: large language model

Curated AI beats frontier LLMs at pharma asset discovery

Łukasz Kidziński, Kevin Thomas
Published 2026-05-06 13:36

The study evaluates the performance of Gosset, an AI platform designed for pharmaceutical asset discovery, against several leading large language models (LLMs) in the context of oncology and immunology targets. The results indicate that Gosset significantly outperforms these LLMs, returning 3.2 times more verified drugs per query while maintaining perfect precision and recall. This suggests that a curated index can enhance the capabilities of existing AI systems in drug discovery.

Read abstract

General-purpose LLMs with web search are increasingly used to scout the competitive landscape of pharmaceutical pipelines. We benchmark Gosset -- an AI platform with a chat interface backed by curated target-, modality-, and indication-level drug-asset annotations -- against four frontier systems with web access (Claude Opus 4.7, GPT 5.5, Gemini 3.1 Pro, Perplexity sonar-pro) on ten niche oncology/immunology targets where most of the pipeline lives in the long tail of preclinical and Asian-developed assets. All five systems receive the same natural-language query and the same JSON output schema. Across 10 targets Gosset returns 3.2x more verified drugs per query than the best frontier system, at perfect precision and 100% recall against the cross-system union of verified drugs. The same curated index is exposed as a Gosset MCP server that any frontier model can call as a tool, suggesting that each of these systems can close most of the recall gap by swapping generic web search for a curated index behind the same chat interface.

ARXIV Cancer: breast cancer Method: pathology foundation model

A Breast Vision Pathology Foundation Model for Real-world Clinical Utility

Yingxue Xu, Zhengyu Zhang, Xiuming Zhang, Mengwei Xu, Fengtao Zhou, Yihui Wang, Jiabo Ma, Yi Xin, Danyi Li, Chengyu Lu, Zhijian Cen, Ying Tan, Qingbing Yao, Qi Wang, Zizhao Gao, Yong Zhang, Jingjing Chen, Feifei Liu, Qian Xu, Yi Dai, Hongxuan Tan, Cheng Jin, Huajun Zhou, Zhengrui Guo, Ling Liang, Hongyi Wang, Yingcong Chen, Xi Wang, Zhenhui Li, Ronald Cheong Kin Chan, Ning Mao, Muyan Cai, Zhe Wang, Li Liang, Hao Chen
Published 2026-05-06 07:44

The study presents BRAVE, a breast-adaptive pathology foundation model designed to enhance clinical utility in breast cancer diagnostics. Evaluated using over 101,000 whole-slide images, BRAVE demonstrated its effectiveness across various clinical tasks, including pre-operative, intra-operative, and post-operative assessments. The model significantly improved diagnostic accuracy and workflow efficiency, supporting the exclusion of low-risk cases and enhancing the identification of high-confidence cases.

Read abstract

Pathology foundation models have shown strong retrospective performance, but whether such systems can support clinically relevant use remains unclear. This challenge is particularly important in breast cancer, where pathological assessment serves as the gold standard for diagnosis and guides treatment planning, surgical decision-making and risk stratification across pre-, intra- and post-operative stages. Here we present \textbf{BRAVE}, a breast-adaptive pathology foundation model developed and evaluated using a total resource of 101,638 breast whole-slide images from 32 sources across Asia, Europe and North America. We assessed BRAVE across 34 tasks in 82 cohorts spanning pre-operative biopsy, intra-operative frozen section and post-operative resection, using an evidence chain comprising retrospective benchmarking, clinically challenging scenarios, workflow-oriented clinical impact simulations, prospective observational validation with the thresholds locked in the retrospective cohorts and crossover pathologist-AI interaction studies. Across these settings, BRAVE supported practical roles in the clinical workflow, including safe exclusion of low-risk cases from routine review, AI-assisted second-review rescue of initially missed positives and prioritization of cases for further assessment. In prospective validation across three centres, BRAVE excluded 76.9% of negative biopsy cases (NPV 0.953) and 70.1% of negative frozen-section cases (NPV 0.973), and triaged 78.8% of post-operative subtyping cases as high-confidence clear-cut cases (NPV 1.000). In reader studies, AI assistance improved balanced accuracy from 88.5% to 95.1% (OR 3.14, P<0.001), with better efficiency, confidence and inter-rater agreement. BRAVE-derived scores also independently predicted disease-free survival (adjusted HR 4.79, P<0.001) and overall survival (adjusted HR 8.14, P<0.001).

ARXIV Cancer: brain tumor Method: 3D U-Net

DALight-3D: A Lightweight 3D U-Net for Brain Tumor Segmentation from Multi-Modal MRI

Nand Kumar Mishra, Dhruv Mishra, Dr Manu Pratap Singh
Published 2026-05-06 05:54

This paper introduces DALight-3D, a lightweight variant of the 3D U-Net designed for automatic brain tumor segmentation from multi-modal MRI. The method employs depthwise separable 3D convolutions and other innovative techniques to reduce computational costs while maintaining performance. Evaluated against several baseline models, DALight-3D demonstrates a mean Dice score of 0.727, indicating a competitive accuracy-efficiency trade-off.

Read abstract

Automatic brain tumor segmentation from multi-modal MRI remains challenging because volumetric models often incur substantial computational cost. This paper presents DALight-3D, a compact 3D U-Net variant that combines depthwise separable 3D convolutions, identifier-conditioned normalization, cross-slice attention, and adaptive skip fusion. The method is evaluated on the Medical Segmentation Decathlon Task01 BrainTumour benchmark under matched optimization settings against standard 3D U-Net, Attention U-Net, Residual 3D U-Net, and V-Net baselines. In the reported 50-epoch comparison, DALight-3D achieves a mean Dice of 0.727 with 2.22M parameters, compared with 0.710 Dice and 3.20M parameters for Residual 3D U-Net. Component-wise ablations show consistent performance degradation when SepConv, identifier-conditioned normalization, CSA, or SSFB is removed. These results indicate that DALight-3D offers a favorable accuracy-efficiency trade-off within the present benchmark setting.

ARXIV Cancer: general cancer Method: unknown

A Zero-Inflated Beta Mixture Model for Marginal Mediation Analysis with Compositional Microbiome Mediators

Seungjun Ahn, Quran Wu, Alicia Yang, Zhigang Li
Published 2026-05-06 00:35

This paper introduces a zero-inflated beta mixture (ZIBM) method designed for mediation analysis involving compositional microbiome mediators. The method addresses challenges such as sparsity and compositional constraints in microbiome data, providing estimates of marginal microbiome-mediated causal effects. Simulation studies show that ZIBM offers improved accuracy and reliability compared to existing methods, and its practical utility is demonstrated through a real microbiome study.

Read abstract

The role of the microbiome in disease pathogenesis is an emerging field with strong evidence suggesting that dysbiosis is associated with precancerous and cancerous states. Microbiome data present substantial challenges for causal mediation analysis due to sparsity, compositional constraints, and latent heterogeneity. To address these issues, we propose a zero-inflated beta mixture (ZIBM) method for mediation analysis with compositional microbiome mediators. The proposed method accommodates excess zeros through a zero-inflation component and captures heterogeneity in non-zero relative abundances using a beta mixture distribution. Within the potential-outcomes framework, the ZIBM provides estimates of marginal microbiome-mediated causal effects, and model parameters are estimated using an expectation-maximization algorithm. Simulation studies demonstrate that the ZIBM yields more accurate estimation and reliable inference under conditions commonly observed in microbiome data, compared with existing approaches. An application to a real microbiome study further illustrates its practical utility. These results indicate that the proposed method provides a more flexible and robust statistical framework for mediation analysis involving compositional microbiome data.

ARXIV Cancer: unknown Method: large language model

Safety and accuracy follow different scaling laws in clinical large language models

Sebastian Wind, Tri-Thien Nguyen, Jeta Sopa, Mahshad Lotfinia, Sebastian Bickelhaup, Michael Uder, Harald Köstler, Gerhard Wellein, Sven Nebelung, Daniel Truhn, Andreas Maier, Soroosh Tayebi Arasteh
Published 2026-05-05 17:57

The paper introduces SaFE-Scale, a framework for assessing the safety of clinical large language models (LLMs) as they scale. It evaluates 34 LLMs using the RadSaFE-200 benchmark, revealing that clean evidence significantly enhances accuracy and reduces high-risk errors. The findings indicate that safety in clinical LLMs is influenced by various factors beyond mere scaling, including evidence quality and retrieval strategies.

Read abstract

Clinical LLMs are often scaled by increasing model size, context length, retrieval complexity, or inference-time compute, with the implicit expectation that higher accuracy implies safer behavior. This assumption is incomplete in medicine, where a few confident, high-risk, or evidence-contradicting errors can matter more than average benchmark performance. We introduce SaFE-Scale, a framework for measuring how clinical LLM safety changes across model scale, evidence quality, retrieval strategy, context exposure, and inference-time compute. To instantiate this framework, we introduce RadSaFE-200, a Radiology Safety-Focused Evaluation benchmark of 200 multiple-choice questions with clinician-defined clean evidence, conflict evidence, and option-level labels for high-risk error, unsafe answer, and evidence contradiction. We evaluated 34 locally deployed LLMs across six deployment conditions: closed-book prompting (zero-shot), clean evidence, conflict evidence, standard RAG, agentic RAG, and max-context prompting. Clean evidence produced the strongest improvement, increasing mean accuracy from 73.5% to 94.1%, while reducing high-risk error from 12.0% to 2.6%, contradiction from 12.7% to 2.3%, and dangerous overconfidence from 8.0% to 1.6%. Standard RAG and agentic RAG did not reproduce this safety profile: agentic RAG improved accuracy over standard RAG and reduced contradiction, but high-risk error and dangerous overconfidence remained elevated. Max-context prompting increased latency without closing the safety gap, and additional inference-time compute produced only limited gains. Worst-case analysis showed that clinically consequential errors concentrated in a small subset of questions. Clinical LLM safety is therefore not a passive consequence of scaling, but a deployment property shaped by evidence quality, retrieval design, context construction, and collective failure behavior.

ARXIV Cancer: brain tumor Method: SegResNet

Enhanced 3D Brain Tumor Segmentation Using Assorted Precision Training

Adwaitt Pandya, Ozioma C. Oguine, Harita Bhargava, Shrikant Zade
Published 2026-05-05 17:30

This research focuses on the early identification of brain tumors, which is critical for patient survival. The study employs the SegResNet architecture for three-dimensional segmentation, utilizing an automatic multi-precision training method. The model achieved a dice score of 0.84 for the tumor core and 0.90 for the whole tumor, indicating effective segmentation performance.

Read abstract

A brain tumor is a medical disorder faced by individuals of all demographics. Medically, it is described as the spread of non-essential cells close to or throughout the brain. Symptoms of this ailment include headaches, seizures, and sensory changes. This research explores two main categories of brain tumors: benign and malignant. Benign spreads steadily, and malignant expresses growth, making it dangerous. Early identification of brain tumors is a crucial factor for the survival of patients. This research provides a state-of-the-art approach to the early identification of tumors within the brain. We implemented the SegResNet architecture, a widely adopted architecture for three-dimensional segmentation, and trained it using the automatic multi-precision method. We incorporated the dice loss function and dice metric for evaluating the model. We got a dice score of 0.84. For the tumor core, we got a dice score of 0.84; for the whole tumor, 0.90; and for the enhanced tumor, we got a score of 0.79.

Find the papers that actually matter