A reporting layer
over evaluation
infrastructure.
Eval Cards is a registry of reported model–benchmark results, organised under a five-level rollout hierarchy and four interpretive signals computed over the joined record.
Every score resolves to an explicit path through this hierarchy, so aggregate claims drill down to the evidence supporting them.
Interpretive signals
Four signals computed over each (model, benchmark, metric-path) record and aggregated to the corpus level. Per-record instances appear on every model and benchmark page.
of reported scores have a complete setup recorded — the rest cannot be independently re-run.
97% have at least one undocumented field. Most often missing: max tokens (96%), temperature (94%).
mean across 50,461 reported score triples.
Observed range: 11% to 93%.
of reported score triples have reports from more than one party.
95% third-party, 5% first-party of 50,461 unique triples.
of setup-eligible groups diverge across variants (266 of 365).
Cross-party divergence: 55%.
Benchmark families
All 62 →Mercor ACE
1 reported benchmark across this family.
AgentHarm
1 reported benchmark across this family.
Mercor APEX-Agents
2 reported benchmarks across this family.
Mercor APEX-v1
1 reported benchmark across this family.
ARC-AGI
6 reported benchmarks across this family.
Artificial Analysis
15 reported benchmarks across this family.