IHC-driven computational pathology benchmark

An Open Benchmark for IHC-driven Computational Pathology

A comprehensive benchmarking framework for evaluating foundation models across diverse immunohistochemistry tasks in precision oncology.

Scale
26,058 WSIs
Patients
13,657 Cases
Patches
464K+
Multi-Center
13 Institutions
Pan-Cancer
27 Organs
Multi-Task
71 Tasks
Scale
26,058 WSIs
Patients
13,657 Cases
Patches
464K+
Multi-Center
13 Institutions
Pan-Cancer
27 Organs
Multi-Task
71 Tasks
Foundation Models
14
Clinical Stains
43
Proteins
14.7
k+
IHC Images
9.6
M+
Patients
13.6
k+
Evaluation Scope · Clinical Tasks
01

IHC Staining Assessment

Evaluating spatial patterns, intensity, subcellular location, and quantity.

IHC
3 tasks
02

Biomarker Expression

Predicting marker-specific protein signals across diverse heterogeneous tissue contexts.

IHC
33 in-domain / 12 OOD
03

Diagnosis & Grading

Robust recognition of histologic patterns across tumor subtypes and progression states.

HE IHC
6 tasks
04

Microenvironment

Fine-grained tissue composition classification and spatial context from IHC inputs.

IHC
10 tasks
05

Progression & Prognosis

Clinical risk stratification, survival analysis and time-to-event outcomes.

HE IHC
9 tasks
06

Therapeutic Response

Predicting treatment efficacy across multiple settings, including chemotherapy, targeted therapy, and immunotherapy.

HE IHC
10 tasks
Benchmark Snapshot · Overall Model Rankings
Model
Overall Mean Rank
(Lower is better)
4 6 8 10 12
Wins
(#1)
Top-3
(count)
1 Virchow2
4.16
20
48
2 UNI
5.98
3
23
3 GigaPath
6.00
4
18
4 GPFM
6.19
15
30
5 CONCH
6.36
6
25
6 MADELEINE
6.48
5
19
7 TITAN
6.71
10
14
8 CONCH v1.5
6.80
9
21
9 H-optimus-0
7.39
2
17
10 Phikon
7.71
3
13
11 Virchow
8.28
2
3
12 CTransPath
8.36
2
10
13 CHIEF
9.48
0
6
14 Prov-GigaPath
11.66
2
3
Mean Rank
Bubble size: Wins (#1 finishes)
0
2
4
8
20
Bubble color: Top-3 count
3
14
26
37
48
Motivation · Why ImmunoBench?

A Critical Blind Spot

Current models focus on H&E, missing crucial IHC signals needed for real-world clinical decisions. ImmunoBench establishes IHC-centered evaluation as a necessary dimension.

Task-Dependent Landscape

No single model dominates. While models excel at local IHC signals, complex clinical endpoints remain challenging, requiring better integration of spatial and molecular context.

An Open Ecosystem

More than a benchmark, ImmunoBench is a durable resource offering curated datasets, standardized protocols, and a dynamic leaderboard to drive future model development.

Key Findings · Benchmark Insights

Performance Boundaries

Models excel at spatial patterns but struggle with sparse signals, outcome prediction, and cross-center generalization.

Architectural Impact

Patch-level models dominate local recognition. IHC-aware and multimodal pretraining provide selective, not universal, benefits.

Multi-Stain Integration

Simply combining H&E and IHC is insufficient. Ensembles reveal current models encode complementary but incomplete information.