OpenEval - Toward the open science of AI evaluation

About

Toward the science of AI evaluation

AI evaluation suffers from opacity and inconsistency, and we argue that item-level benchmark data is essential to the science of AI evaluation.

📣 Read our position paper

Existing problems

AI evaluation results are scattered across benchmarks and repositories in incompatible formats, requiring significant manual effort and custom parsing for any cumulative, meta, or item-level analysis and auditing.

Our approach

We propose OpenEval, an item-centered repository where every evaluation instance, including item content, responses, and statistics, is captured in a self-contained entry while connected to the broader experimental context.

From Benchmarks to Science

A consistent, fine-grained data format is fundamental to scientific investigation of AI evaluation, including benchmark design and construction, test validation, and downstream analysis and meta-research.

Open contribution, shared merit

OpenEval supports data contribution for any evaluation suite or model in a few simple steps. Growing with every submission, it serves as a living foundation for community-wide AI evaluation research.

Schema

The item-centered schema

OpenEval organizes results at the item level under a unified schema, making cross-benchmark analysis and community contribution straightforward.

Each entry in OpenEval is built around an item — the atomic unit of evaluation.

item_id Unique identifier for the item.
item_metadata Ingestion timestamp, contributor info, and benchmark provenance.
item_content[] Original item content from the source benchmark, which is divided into input[] and references[].
responses[] Response(s) to the item (see the next tab).

Given the item, each response records the test conditions and outcomes for a single request.

response_id Unique identifier for the response, prefixed with the associated item_id.
model Respondent model (see the next tab).
item_adaptation How the item was adapted for the evaluation run, including the actual request_input[], few-shot demonstrations, and test environment resources.
response_content[] The output returned by the model.
scores[] Metric score(s) of the response (see the last tab).

Each model records the respondent model's identity and associated test settings.

nameName of the respondent model.
size Parameter count or size label.
model_adaptation How the model was adapted for the evaluation run, including the system instruction, generation config, and available tools.

Each metric extracts capability evidence from a response and produces a score.

nameName of the evaluation metric.
models[] Judge model(s) used, if the metric is model-based.
extra_artifacts[] Additional artifact(s) used for scoring, e.g., evaluation rubrics.

{ "item_id": "wildbench_20260220T224823Z_0", # AUTO "item_metadata": { "ingestion_time": "20260220T224823Z", # AUTO "contributor": { "name": "Anonymous", # OPTIONAL "email": "", # OPTIONAL "affiliation": "Human-Centered Eval", # OPTIONAL }, "source": { "benchmark_name": "WildBench", "benchmark_version": "v2", "paper_url": "https://iclr.cc/virtual/2025/poster/29940", # OPTIONAL "dataset_url": "https://huggingface.co/datasets/allenai/WildBench/viewer/v2", # OPTIONAL "benchmark_tags": [ "real-world", "Instruction Following", "multi-turn chat" ] # OPTIONAL } }, "item_content": { "input": [ { "role": "user", "content": "answer all of the questions you can (put the question and then the answer under it): ..." }, { "role": "assistant", "content": "The function that models the data in the table is:\n\nF(x) = 200 * (1/2)^x" }, { "role": "user", "content": "no answer all of the questions given" } ], "references": [] }, "responses": [ { # See the next tab for a response }, ... ], "schema_version": "v0.1.0" # AUTO }

{ "response_id": "wildbench_20260220T224823Z_0_gemini-3-pro-preview_0", # AUTO "model": { # See the next tab for a model }, "item_adaptation": { "request_input": [ { "role": "user", "content": "answer all of the questions you can (put the question and then the answer under it): ..." }, { "role": "assistant", "content": "The function that models the data in the table is:\n\nF(x) = 200 * (1/2)^x" }, { "role": "user", "content": "no answer all of the questions given" } ], "demonstrations": [], # OPTIONAL "external_resources": [] }, "response_content": [ { "text": "Here are the answers to the questions provided...", "logprob": 0.0, "tokens": [], "finish_reason": { "reason": "stop" }, "thinking": { "text": "**Evaluating the Problems**\n\nI've meticulously assessed the problems..." } } ], "scores": [ { "metric": { # See the last tab for a metric }, "value": 0.7777777777777778 }, ... ] }

{ "name": "Gemini 3 Pro Preview", "size": "", # OPTIONAL "model_adaptation": { "system_instruction": "", "generation_parameters": { "temperature": 0.0, "do_sample": false, "top_k": 1, "top_p": 1.0, "max_tokens": 12000, ... }, "tools": [] } }

{ "name": "wildbench_score_rescaled", "models": [ "gpt-4o" ], "extra_artifacts": [ { "type": "checklist", "content": [ "Does the AI output provide answers to all the questions listed in the user's query?", "Are the answers provided by the AI accurate and correctly calculated based on the information given in the questions?", "Does the AI output maintain clarity and coherence in presenting the answers to each question?", ... ] } ] }

Features

What makes OpenEval different

Every design decision prioritizes transparency, reproducibility, scalability, and community participation.

🧩

Item-Centricity

Results are archived as unique items with associated fields, facilitating customized reassembly and analysis.

🌐

Broad Coverage

A growing repository with 155k+ items and 10M responses, covering diverse models, tasks, and constructs.

🗂️

Unified Schema

One JSON schema comprehensively standardizes results across different formats and benchmarks.

📷

Faithful Snapshots

Each item captures a full experimental context, including adaptations and environments, for reproducibility.

🔍

Traceable Archive

Every item is timestamped and linked to its original source with metadata for in-depth exploration and auditing.

🔓

Open Access

All data is publicly available and freely reusable. Anyone can conveniently share their own evaluation results.

Coverage

Benchmarks in OpenEval

OpenEval currently indexes results from 24 benchmarks, spanning standard NLP tasks, emerging LLM capabilities, and interdisciplinary evaluations.

EmoBench MMLU-Pro IFEval SALAD-Bench OpenToM XSTest XSum GPQA MoralBench WildBench TruthfulQA Hi-ToM BBQ CNNDM Omni-MATH CulturalBench Do-Not-Answer

General Capability Reasoning Domain Knowledge Language Ability Safety Social Intelligence

Models in OpenEval

OpenEval covers a growing range of open and proprietary models, with approximately 70 models per benchmark on average.

GPT-5.4 pro Llama 4 Maverick Mixtral 8x22B Mistral Large Llama 3.1 405B Instruct Claude Sonnet 4.5 Gemini 3 Pro DeepSeek-V3 GPT-5.4 nano Gemma 3 27B Llama 4 Scout GPT-5.3 Chat Gemini 2.5 Flash Claude Haiku 4.5 Mistral 7B Qwen3-Next-80B-A3B-Thinking gpt-oss-120b Kimi K2 Instruct Claude Opus 4 Grok 4

OpenAI Anthropic Google Meta Mistral AI Other

Contribute

All your evaluation results are welcome

Sharing your evaluation results takes just a few steps — all guidelines and resources are available in our GitHub repository.

1

Format your results

Convert your evaluation results into the OpenEval schema. You may refer to the templates, examples, and converters for HELM and OpenLLM results in our GitHub repository.

2

Validate your submission

Run the provided validator to confirm all required fields are present and correctly formatted before submission.

3

Open a pull request

Open a pull request in our GitHub repository with your .json files and stay informed throughout the review process.

4

See your results go live

After a brief review, your results are shown in OpenEval and made immediately available to the broader research community.

5

We are here to help

Still have questions or need guidance? Reach out to the team via GitHub issues or email — we are happy to assist you through the contribution process.

Item by item, towardthe open science of AI evaluation.