A reporting layer
over evaluation
infrastructure.
Evaluation Cards is a collection of reported model–benchmark results, organised under a five-level rollout hierarchy and four interpretive signals computed over the joined record.
Interpretive signals
Four signals computed over each (model, benchmark, metric-path) record and aggregated to the corpus level. Per-record instances appear on every model and benchmark page.
of reported scores have a complete setup recorded. The rest cannot be independently re-run.
97% have at least one undocumented field. Most often missing: max tokens (96%), temperature (94%).
mean across 52,672 reported score triples.
Observed range: 7% to 93%.
of reported score triples have reports from more than one party.
95% third-party, 5% first-party of 52,672 unique triples.
of setup-eligible groups diverge across variants (212 of 318).
Cross-party divergence: 53%.
Benchmark families
All 58 →Mercor ACE
1 reported benchmark across this family.
AgentHarm
1 reported benchmark across this family.
Mercor APEX
2 reported benchmarks across this family.
ARC-AGI family
3 reported benchmarks across this family.
Artificial Analysis
16 reported benchmarks across this family.
BFCL
1 reported benchmark across this family.
Every score resolves to an explicit path through this hierarchy, so aggregate claims drill down to the evidence supporting them.