About

Building a shared infrastructure for informative, transparent, and comparable AI evaluations.

Overview

Evaluations are the backbone of progress in AI, yet the ways they are documented and shared have not kept pace with the field’s growth. Today, evaluations are produced by a growing mix of first- and third-party actors, using diverse methods, formats, and assumptions. As a result, it is increasingly difficult to understand what evaluations exist, how they are conducted, or what they ultimately tell us about an AI model or system.

We envision a world in which AI evaluations are informative, transparent, and comparable by default. In this world, developers, researchers, policymakers, and downstream users can quickly understand how an AI system has been evaluated.

The Eval Cards Proposal

Design Information

Documenting what an evaluation measures and how its results should be interpreted, covering task definition and validity considerations.

EEE Schema

The "Every Eval Ever" standardized reporting schema for inference- and execution-level details (temperature, tokens, etc.).

Central Platform

A shared repository linking design info with run data, allowing exploration by model or evaluation.

Motivation

Evaluations come in a variety of forms and formats depending on the organization conducting them. Today, the lack of standardization across evaluation design information and evaluation run metadata limits the impact of evaluations because they are not readily comparable or available.

Moreover, they remain scattered across numerous repos, sites, tables, and papers, making it difficult to grasp what evaluations of a given AI system have been conducted.

Why Eval Cards?

Just as model cards have catalyzed common documentation practices for AI systems, Eval Cards aim to establish a norm for structured reporting of AI evaluations themselves.

By standardizing how evaluation design information and run-level metadata are reported, Eval Cards make apples-to-apples comparison possible and reduce duplicated infrastructure work for evaluation research.

Current State

Eval Cards are actively under development by the EvalEval coalition. We have completed the following milestones:

Developed a draft version of the EEE schema
Designed a GUI mockup for the central platform
Designed in-platform explanatory tooltips for design details
Integrated with Eval Factsheets repository
We are currently soliciting community feedback on all components through mid-January 2026.

Next Steps

Release: February 2026

In the lead-up to this release, we are:

  • Actively engaging with model developers, independent evaluation organizations, and research groups to solicit feedback and encourage early adoption.
  • Continuing to develop the Eval Cards platform as a central, publicly accessible repository where evaluations can be submitted, discovered, and compared.

Following the initial release, we will maintain and evolve the Eval Cards format in consultation with the research and practitioner communities.

The EvalEval Coalition

A Global Research Community

We are a community of 400+ researchers and practitioners developing rigorous AI evaluation methods and the infrastructure needed to deploy them at scale for real-world impact.

Get Involved

Groups interested in collaborating with us on Eval Cards are invited to submit an expression of interest.