OpenEval - Toward the open science of AI evaluation

About

Toward the science of AI evaluation

AI evaluation suffers from opacity and inconsistency, and we argue that item-level benchmark data is essential to the science of AI evaluation.

Existing problems

AI evaluation results are scattered across benchmarks and repositories in incompatible formats, requiring significant manual effort and custom parsing for any cumulative, meta, or item-level analysis and auditing.

Our approach

We propose OpenEval, an item-centered repository where every evaluation instance, including item content, responses, and statistics, is captured in a self-contained entry while connected to the broader experimental context.

From Benchmarks to Science

A consistent, fine-grained data format is fundamental to scientific investigation of AI evaluation, including benchmark design and construction, test validation, and downstream analysis and meta-research.

Open contribution, shared merit

OpenEval supports data contribution for any evaluation suite or model in a few simple steps. Growing with every submission, it serves as a living foundation for community-wide AI evaluation research.

Schema

The item-centered schema

OpenEval organizes results at the item level under a unified schema, making cross-benchmark analysis and community contribution straightforward.

Each entry in OpenEval is built around an item — the atomic unit of evaluation.

item_id Unique identifier for the item.
item_metadata Ingestion timestamp, contributor info, and benchmark provenance.
item_content[] Original item content from the benchmark (question, dialogue, prompt, etc.)
responses[] Response(s) to the item (see the next tab).

Given the item, each response records the test conditions and outcomes for a single model run.

response_id Unique identifier for the response.
model Respondent model (see the next tab).
item_adaptation How the item was adapted for the evaluation run, including the actual model input, few-shot demonstrations, and test environment resources.
response_content[] The output produced by the model.
scores[] Metric score(s) of the response (see the last tab).

Each model records the respondent model's identity and associated test settings.

nameName of the respondent model.
size Parameter count or size label.
model_adaptation How the model was adapted for the evaluation run, including the system instruction, generation config, and available tools.

Each metric extracts capability evidence from a response and produces a score.

nameName of the evaluation metric.
references[] Reference(s) used, if the metric is reference-based.
models[] Judge model(s) used, if the metric is model-based.
extra_artifacts[] Additional artifact(s) used for scoring, e.g., rubrics.

{ "item_id": "wildbench_20260220T224823Z_0", # AUTO "item_metadata": { "ingestion_time": "20260220T224823Z", # AUTO "contributor": { "name": "Anonymous", # OPTIONAL "email": "", # OPTIONAL "affiliation": "HCEval", # OPTIONAL }, "source": { "benchmark_name": "WildBench", "benchmark_version": "v2", "benchmark_url": "https://huggingface.co/datasets/allenai/WildBench/viewer/v2", # OPTIONAL "benchmark_tags": [ "real-world", "Instruction Following", "multi-turn chat" ] # OPTIONAL } }, "item_content": [ { "role": "user", "content": "answer all of the questions you can (put the question and then the answer under it): ..." }, { "role": "assistant", "content": "The function that models the data in the table is:\n\nF(x) = 200 * (1/2)^x" }, { "role": "user", "content": "no answer all of the questions given" } ], "responses": [ { # See the next tab for a response }, ... ], "schema_version": "v0.1.0" # AUTO }

{ "response_id": "wildbench_20260220T224823Z_0_gemini-3-pro-preview_0", # AUTO "model": { # See the next tab for a model }, "item_adaptation": { "input_content": [ { "role": "user", "content": "answer all of the questions you can (put the question and then the answer under it): ..." }, { "role": "assistant", "content": "The function that models the data in the table is:\n\nF(x) = 200 * (1/2)^x" }, { "role": "user", "content": "no answer all of the questions given" } ], "demonstrations": [], # OPTIONAL "external_resources": [] # OPTIONAL }, "response_content": [ { "text": "Here are the answers to the questions provided...", "logprob": 0.0, "tokens": [], "finish_reason": { "reason": "stop" }, "thinking": { "text": "**Evaluating the Problems**\n\nI've meticulously assessed the problems..." } } ], "scores": [ { "metric": { # See the last tab for a metric }, "value": 0.7777777777777778 }, ... ] }

{ "name": "Gemini 3 Pro Preview", "size": null, # OPTIONAL "model_adaptation": { "system_instruction": "", "generation_parameters": { "temperature": 0.0, "top_k": 1, "top_p": 1, "max_tokens": 12000, ... }, "tools": [] } }

{ "name": "wildbench_score_rescaled", "references": [], "models": [ "gpt-4o" ], "extra_artifacts": [ { "type": "checklist", "content": [ "Does the AI output provide answers to all the questions listed in the user's query?", "Are the answers provided by the AI accurate and correctly calculated based on the information given in the questions?", "Does the AI output maintain clarity and coherence in presenting the answers to each question?", ... ] } ] }

Features

What makes OpenEval different

Every design decision prioritizes transparency, reproducibility, scalability, and community participation.

🧩

Item-Centricity

Results are archived as unique items with associated fields, facilitating customized reassembly and analysis.

🌐

Broad Coverage

A growing repository with 225k+ items and 8M+ responses, covering diverse models, tasks, and constructs.

🗂️

Unified Schema

One JSON schema comprehensively standardizes results across different formats and benchmarks.

📷

Faithful Snapshots

Each item captures a full experimental context, including adaptations and environments, for reproducibility.

🔍

Traceable Archive

Every item is timestamped and linked to its original source with metadata for in-depth exploration and auditing.

🔓

Open Access

All data is publicly available and freely reusable. Anyone can conveniently share their own evaluation results.

Coverage

Benchmarks in OpenEval

OpenEval currently indexes results from 64 benchmarks, spanning standard NLP tasks, emerging LLM capabilities, and interdisciplinary evaluations.

EmoBench MMLU-Pro BIG-Bench-Hard LegalBench IFEval SALAD-Bench OpenToM GPQA BLiMP WildBench MATH TruthfulQA MedQA SafetyBench Hi-ToM SQuAD MMLU MuSR cultural-trends MS MARCO BBQ WorldValuesBench RealToxicityPrompts Omni-MATH CulturalBench Do-Not-Answer HumanEval

General Capability Reasoning Domain Expertise Language Ability Safety Social Intelligence

Models in OpenEval

OpenEval covers a growing range of open and proprietary models, with the number of models per benchmark ranging from dozens to thousands.

Llama 2 70B Mistral 7B Mixtral 8x7B Mistral Large Llama 3.1 Instruct Turbo Claude 3.5 Haiku DeepSeek-V3 o3 Claude 3.7 Sonnet Gemma 3 27B Llama 4 Scout o4-mini Gemini 2.5 Flash-Lite Qwen3-Next 80B A3B Thinking gpt-oss-120b GPT-5 Kimi K2 Instruct IBM Granite 4.0 Micro Claude 4 Opus Claude 4.5 Sonnet Gemini 3 Pro Grok 4

OpenAI Anthropic Google Meta Mistral AI Other

Contribute

All your evaluation results are welcome

Sharing your evaluation results takes just a few steps — all guidelines and resources are available in our GitHub repository.

1

Format your results

Convert your evaluation results into the OpenEval schema. You may refer to the templates, examples, and converters for HELM and OpenLLM results in our GitHub repository.

2

Validate your submission

Run the provided validator to confirm all required fields are present and correctly formatted before submission.

3

Open a pull request

Open a pull request in our GitHub repository with your .json files and stay informed throughout the review process.

4

See your results go live

After a brief review, your results are shown in OpenEval and made immediately available to the broader research community.

5

We are here to help

Still have questions or need guidance? Reach out to the team via GitHub issues or email — we are happy to assist you through the contribution process.

Item by item, towardthe open science of AI evaluation.