About · Working paper v0.4

A reporting layer for AI evaluation.

Eval Cards is a structured registry of how AI models are evaluated — and, just as importantly, of what is left undocumented. It composes existing evaluation infrastructure into a single audience-agnostic reading surface. It is a research artifact of the EvalEval Coalition, a community of academic and industrial labs working on broader-impact evaluation of AI systems.

Benchmark scores are routinely reported without the context required to interpret them: prompts, decoding parameters, evaluator identity, reproduction artifacts, scope of validity. Eval Cards treats every published evaluation as a claim, and every absent field as a claim not made. Neither is an error — the distinction is what makes the public record useful.

The card format is audience-agnostic. A researcher and a policy analyst look at different fields on the same record. Reader modes (Research · Policy) surface the fields most load-bearing for each audience; the underlying data is shared.

What it is built on

Auto-BenchmarkCards

A schema for benchmark-level metadata — what a benchmark measures, its splits, intended use, validity scope, and known limitations. Each benchmark family in this registry has an Auto-BenchmarkCard at the family root and a Policy Note compressed for plain-language reading.

Every Eval Ever

A run-level corpus of public evaluation results — (model, benchmark, metric-path, value, source) tuples extracted from papers, model cards and leaderboards. Provides the raw rows the registry canonicalises and joins.

IBM Risk Atlas alignment

Risk-domain annotations on benchmarks (capability, robustness, safety, agentic risk, fairness) so policy readers can locate which deployment-relevant property a number speaks to.

Five-level hierarchy

Family → Composite → Single benchmark → Slice → Metric. Every score resolves to an explicit path, so aggregate claims drill down to the evidence supporting them.

Two reader modes, one record

The same evaluation record renders differently depending on the question the reader brings to it. Toggle in the topbar; the URL, the data and the citations are unchanged.

Research

Methodology read

Setup variants, n-shot, decoding parameters, evaluator identity, confidence intervals, and the specific schema fields missing for reproduction are foregrounded on every metric row.

Policy

Plain-language read

Policy Notes (measures · caveat · intended for), risk-domain annotations, first/third-party evaluator tags, and disclosure-gap flags are foregrounded; metric configuration is compressed.

Four interpretive signals

Computed over each (model, benchmark, metric-path) record and aggregated to the corpus. Per-record instances appear on every model and benchmark page; corpus rollups appear on the home page.

  1. 01

    Reproducibility

    Can a third party run this evaluation and obtain a comparable number? Tracks setup-variant disclosure, prompt and decoding parameters, harness version, seed, and code/artifact availability.

  2. 02

    Completeness

    Does the record meet the standard report card for this class of model? Tracks coverage across capability, robustness, safety and fairness benchmarks expected for the model's claimed use.

  3. 03

    Provenance & risk

    Who produced this number, and which deployment-relevant property does it speak to? Tracks evaluator identity (first-party / third-party), source citation, and IBM Risk Atlas-aligned risk domain.

  4. 04

    Comparability

    Can two scores under the same benchmark be put side-by-side? Tracks slice, metric variant, and unit harmonisation; flags rows that cannot be ranked together.

Methodology

  1. M.1

    Canonicalisation

    Heterogeneous score reports — papers, model cards, leaderboards, blog posts — are normalised to (model, benchmark, slice, metric, value, source) tuples. Model name aliases and benchmark version aliases are resolved against a curated mapping.

  2. M.2

    Source attribution

    Each record cites the document of record with a line reference. Where multiple sources report the same configuration, the developer's primary source is preferred and discrepancies are flagged.

  3. M.3

    Evaluator identity

    Two categories only: first-party (the model developer) and third-party (an independent evaluator). The two are tagged distinctly and never silently merged; if both have reported on a (model, benchmark) pair, both rows appear separately.

  4. M.4

    No imputation

    Empty cells are empty. The registry never estimates, infers, or cross-fills missing values. Disclosure gaps are surfaced as such.

  5. M.5

    Snapshot discipline

    Each release is a dated snapshot. Numbers are not back-edited; corrections add a new version with provenance preserved.

Principles

  1. 01

    We do not impute.

    If a developer did not publish a score, the cell is empty. We do not estimate, infer, or cross-fill.

  2. 02

    Every number cites its source.

    Each reported score resolves to a specific document — paper, model card, blog post — with a line reference.

  3. 03

    Evaluator identity matters.

    First-party and third-party results are visually distinct and never silently merged. When both have reported on the same (model, benchmark) pair, both rows are kept side by side.

  4. 04

    Gaps are data.

    Undisclosed fields appear alongside disclosed ones. Silence about a safety benchmark is itself information.

  5. 05

    Aggregates resolve to evidence.

    Every corpus-level claim drills down to the (model, benchmark, metric-path) records that support it. No black-box scores.

  6. 06

    Corrections are welcome.

    Each record links a correction path. The registry is a living artifact; coverage improves as developers publish.

What this registry does not do

  • Produce a single capability ranking. Metrics across benchmarks are heterogeneous and not commensurable; rolling them into one score throws away the information that makes evaluation useful.
  • Evaluate models. Eval Cards reports on what others have already evaluated. New runs go through the upstream Every Eval Ever pipeline, not this surface.
  • Endorse a benchmark. Inclusion in the registry is a statement about disclosure prevalence, not benchmark quality. Policy Notes describe limitations; reading them is part of using the registry.
  • Replace model cards or system cards. Eval Cards complements them — it is the cross-model, cross-benchmark reading surface that individual cards alone cannot provide.

Citation & corrections

Cite as

EvalEval Coalition. (2026). Eval Cards: a reporting layer for AI evaluation (Working paper v0.4, snapshot 18 Apr 2026). evalcards.evalevalai.com

Submit a correction

Each record links a correction path. Disclosure gaps close as developers and third parties publish; we accept patches against any (model, benchmark, metric-path) tuple with a citation.

corrections@evalevalai.com
Back to homeBrowse modelsBrowse evaluations