Signals v1.0

A reporting layer
over evaluation
infrastructure.

Eval Cards is a registry of reported model–benchmark results, organised under a five-level rollout hierarchy and four interpretive signals computed over the joined record.

Browse models Browse evaluations About

Corpus snapshot · May 5, 2026

5,498

Models

Tracked across reporting sources

101,843

Reported results

(model, benchmark, metric) triples

Reporting organizations

Distinct evaluator initiatives in this corpus

824

Model developers

Distinct model-publishing organizations

Benchmark families

Top of the rollout hierarchy

635

Single benchmarks

733 slices · 761 metrics

Five-level rollout hierarchy

Family

SWE-bench family, MMLU family

Composite

Open LLM Leaderboard v2, HELM Instruct

Single benchmark

635

GSM8K, IFEval, MMLU-Pro

Slice

733

algebra (within MATH), level-5, multi-turn

Metric

761

pass@1, accuracy, F1

Every score resolves to an explicit path through this hierarchy, so aggregate claims drill down to the evidence supporting them.

Interpretive signals

Four signals computed over each (model, benchmark, metric-path) record and aggregated to the corpus level. Per-record instances appear on every model and benchmark page.

RReproducibility

of reported scores have a complete setup recorded — the rest cannot be independently re-run.

97% have at least one undocumented field. Most often missing: max tokens (96%), temperature (94%).

AsksCan someone else run this evaluation and get the same number?

CCompleteness

72%

mean across 50,461 reported score triples.

Observed range: 11% to 93%.

AsksIs the benchmark itself documented well enough to interpret a score on it?

PProvenance

of reported score triples have reports from more than one party.

95% third-party, 5% first-party of 50,461 unique triples.