Building a shared infrastructure for informative, transparent, and comparable AI evaluations.
Evaluations are the backbone of progress in AI, yet the ways they are documented and shared have not kept pace with the field’s growth. Today, evaluations are produced by a growing mix of first- and third-party actors, using diverse methods, formats, and assumptions. As a result, it is increasingly difficult to understand what evaluations exist, how they are conducted, or what they ultimately tell us about an AI model or system.
We envision a world in which AI evaluations are informative, transparent, and comparable by default. In this world, developers, researchers, policymakers, and downstream users can quickly understand how an AI system has been evaluated.
Documenting what an evaluation measures and how its results should be interpreted, covering task definition and validity considerations.
The "Every Eval Ever" standardized reporting schema for inference- and execution-level details (temperature, tokens, etc.).
A shared repository linking design info with run data, allowing exploration by model or evaluation.
Evaluations come in a variety of forms and formats depending on the organization conducting them. Today, the lack of standardization across evaluation design information and evaluation run metadata limits the impact of evaluations because they are not readily comparable or available.
Moreover, they remain scattered across numerous repos, sites, tables, and papers, making it difficult to grasp what evaluations of a given AI system have been conducted.
Just as model cards have catalyzed common documentation practices for AI systems, Eval Cards aim to establish a norm for structured reporting of AI evaluations themselves.
By standardizing how evaluation design information and run-level metadata are reported, Eval Cards make apples-to-apples comparison possible and reduce duplicated infrastructure work for evaluation research.
Eval Cards are actively under development by the EvalEval coalition. We have completed the following milestones:
In the lead-up to this release, we are:
Following the initial release, we will maintain and evolve the Eval Cards format in consultation with the research and practitioner communities.
We are a community of 400+ researchers and practitioners developing rigorous AI evaluation methods and the infrastructure needed to deploy them at scale for real-world impact.