Signals v1.0

A reporting layer
over evaluation
infrastructure.

Eval Cards is a registry of reported model–benchmark results, organised under a five-level rollout hierarchy and four interpretive signals computed over the joined record.

Corpus snapshot · May 5, 2026
5,498
Models
Tracked across reporting sources
101,843
Reported results
(model, benchmark, metric) triples
30
Reporting organizations
Distinct evaluator initiatives in this corpus
824
Model developers
Distinct model-publishing organizations
62
Benchmark families
Top of the rollout hierarchy
635
Single benchmarks
733 slices · 761 metrics
Five-level rollout hierarchy
01
Family
62
SWE-bench family, MMLU family
02
Composite
10
Open LLM Leaderboard v2, HELM Instruct
03
Single benchmark
635
GSM8K, IFEval, MMLU-Pro
04
Slice
733
algebra (within MATH), level-5, multi-turn
05
Metric
761
pass@1, accuracy, F1

Every score resolves to an explicit path through this hierarchy, so aggregate claims drill down to the evidence supporting them.

Interpretive signals

Four signals computed over each (model, benchmark, metric-path) record and aggregated to the corpus level. Per-record instances appear on every model and benchmark page.

RReproducibility
3%

of reported scores have a complete setup recorded — the rest cannot be independently re-run.

97% have at least one undocumented field. Most often missing: max tokens (96%), temperature (94%).

AsksCan someone else run this evaluation and get the same number?
CCompleteness
72%

mean across 50,461 reported score triples.

Observed range: 11% to 93%.

AsksIs the benchmark itself documented well enough to interpret a score on it?
PProvenance
4%

of reported score triples have reports from more than one party.

95% third-party, 5% first-party of 50,461 unique triples.

AsksWho reported this score, and have others reproduced it?
XComparability
73%

of setup-eligible groups diverge across variants (266 of 365).

Cross-party divergence: 55%.

AsksAre scores on the same benchmark actually measuring the same thing?

Benchmark families

All 62