About

A reporting layer for AI evaluations.

Evaluation Cards is a structured collection of how AI models are evaluated — and, just as importantly, of what is left undocumented. It composes existing evaluation infrastructure into a single audience-agnostic reading surface. It is a research artifact of the EvalEval Coalition, a cross-sector and interdisciplinary coalition community of 500+ individuals working on broader-impact evaluation of AI systems.

Benchmark scores are routinely reported without the context required to interpret them: prompts, decoding parameters, evaluator identity, reproduction artifacts, scope of validity. Evaluation Cards treats every published evaluation as a claim, and every absent field as a claim not made. Neither is an error — the distinction is what makes the public record useful.

Principles

  1. 01

    We do not impute.

    If a developer did not publish a score, the cell is empty. We do not estimate, infer, or cross-fill.

  2. 02

    Every number cites its source.

    Each reported score resolves to a specific document — paper, model card, blog post — with a line reference.

  3. 03

    Evaluator identity matters.

    First-party and third-party results are visually distinct and never silently merged. When both have reported on the same (model, benchmark) pair, both rows are kept side by side.

  4. 04

    Gaps are data.

    Undisclosed fields appear alongside disclosed ones. Silence about a safety benchmark is itself information.

  5. 05

    Aggregates resolve to evidence.

    Every corpus-level claim drills down to the (model, benchmark, metric-path) records that support it. No black-box scores.

  6. 06

    Corrections are welcome.

    Each record links a correction path. Evaluation Cards is a living artifact; coverage improves as developers publish.

How to contribute

Evaluation Cards is a living, community artifact — its coverage and usefulness grow as people report, upload, use, and cite it. Here's what helps most, depending on who you are.

Model developers
  • Report your model's results to Every Eval Ever so they show up here in context.
  • Already on EEE? Cross-post them to Hugging Face so your scores appear on the model page with a backlink.
  • Document the run-level details that raise your signals — temperature and max tokens, the harness, and (for agentic evaluations) the eval plan and limits.
  • See a wrong or missing number for your model? Flag it in the Space discussions or via each record's correction path.
Evaluation developers
  • Upload your benchmark's results to Every Eval Ever so others can find, run, and reuse them.
  • Fill in your benchmark's metadata — goals, construct, scoring rubric, intended uses, and limitations — to raise its completeness score.
  • Report schema gaps or data issues on the EEE issue tracker.
Researchers
  • Use Evaluation Cards in your model-, evaluation-, or field-level analysis — and cite the paper when you build on it.
  • Report third-party results you've run to Every Eval Ever — independent numbers are first-class here.
  • Flag discrepancies or suggest methodology improvements on the issue tracker or in the discussions.
  • Spread the word — share it with collaborators and on socials.
Policymakers
  • Consult Evaluation Cards as an evidence base — what's documented, who reported it, and how comparable it is.
  • Cite the paper in reports and briefings, and point colleagues to the site.
  • Tell us what evidence you need for decisions — suggest features on the public roadmap or via the feedback form.
  • Spread the word so more of the field reports legibly.

Spotted an error? A wrong or missing number anywhere in the corpus can be flagged through the feedback form with a source — corrections are versioned, and coverage improves as developers and third parties publish.

Not sure where something fits? The public roadmap, the feedback form, the EEE issue tracker, and the Space discussions are always open.

How to cite

If you find this effort useful, please consider citing our paper and sharing our work on socials.

Reference

Ghosh, A., Reuel, A., Chim, J., Kennedy, W. M., et al. (2026). Evaluation Cards: An Interpretive Layer for AI Evaluation Reporting. arXiv:2606.09809.

BibTeX · Evaluation Cards
@article{ghosh2026evaluationcards,
  title        = {Evaluation Cards: An Interpretive Layer for AI Evaluation Reporting},
  author       = {Ghosh, Avijit and Reuel, Anka and Chim, Jenny and Kennedy, Wm. Matthew and Yadav, Srishti and Mickel, Jennifer and Long, Yanan and Tran, Andrew and Kornilova, Anastassia and Stachura, Damian and Klyman, Kevin and Friedrich, Felix and Sania, Jeba and Lamparth, Max and Batzner, Jan and Mishra, Anoop and Habba, Eliya and Hao, Yixiong and Heath, Nathan and Rismani, Shalaleh and Gohar, Usman and Loehr, Andrea and Manheim, David and Dhar, Ruchira and Nelaturu, Sree Harsha and Sinha, Aarush and Choshen, Leshem and Sharma, Drishti and Khire, Ishan and Saha, Amit and Sahoo, Subramanyam and Hardy, Michael and Riegler, Michael Alexander and Manghnani, Kabir and Lin, Michelle and Jiang, Yanan and Huang, Yilin and Yehudai, Asaf and Ji, Jessica and Hofmann, Aris and Akhtar, Mubashara and Moniz, Nuno and Jernite, Yacine and Biderman, Stella and Talat, Zeerak and Koyejo, Sanmi and Kochenderfer, Mykel and Solaiman, Irene},
  journal      = {arXiv preprint arXiv:2606.09809},
  year         = {2026},
  url          = {https://arxiv.org/abs/2606.09809}
}

Every Eval Ever (EEE) is a sister EvalEval project and one of the data sources that powers Evaluation Cards — please show it some love and cite it too. 💜

BibTeX · Every Eval Ever
@misc{evaleval2026everyevalever,
  title   = {Every Eval Ever: Toward a Common Language for AI Eval Reporting},
  author  = {Jan Batzner and Leshem Choshen and Avijit Ghosh and Sree Harsha Nelaturu and Anastassia Kornilova and Damian Stachura and Yifan Mai and Asaf Yehudai and Anka Reuel and Irene Solaiman and Stella Biderman},
  year    = {2026},
  month   = {February},
  url     = {https://evalevalai.com/infrastructure/2026/02/17/everyevalever-launch/},
  note    = {Blog Post, EvalEval Coalition}
}
Back to homeBrowse modelsBrowse evaluationsRead the paper