A reporting layer
over evaluation
infrastructure.

Evaluation Cards is a collection of reported model–benchmark results, organised under a five-level rollout hierarchy and four interpretive signals computed over the joined record.

Corpus snapshot · June 17, 2026
5,794
Models
Tracked across reporting sources
102,530
Reported results
(model, benchmark, metric) triples
30
Reporting organizations
Distinct evaluator initiatives in this corpus
820
Model developers
Distinct model-publishing organizations
58
Benchmark families
Top of the rollout hierarchy
633
Single benchmarks
699 slices · 753 metrics

Interpretive signals

Four signals computed over each (model, benchmark, metric-path) record and aggregated to the corpus level. Per-record instances appear on every model and benchmark page.

Benchmark families

All 58
Five-level rollout hierarchy

Every score resolves to an explicit path through this hierarchy, so aggregate claims drill down to the evidence supporting them.