OpenEval is an open-source, item-level repository that builds the foundation for AI evaluation science, advancing open science toward more accessible, transparent, and rigorous model assessment.
AI evaluation suffers from opacity and inconsistency, and we argue that item-level benchmark data is essential to the science of AI evaluation.
AI evaluation results are scattered across benchmarks and repositories in incompatible formats, requiring significant manual effort and custom parsing for any cumulative, meta, or item-level analysis and auditing.
We propose OpenEval, an item-centered repository where every evaluation instance, including item content, responses, and statistics, is captured in a self-contained entry while connected to the broader experimental context.
A consistent, fine-grained data format is fundamental to scientific investigation of AI evaluation, including benchmark design and construction, test validation, and downstream analysis and meta-research.
OpenEval supports data contribution for any evaluation suite or model in a few simple steps. Growing with every submission, it serves as a living foundation for community-wide AI evaluation research.
OpenEval organizes results at the item level under a unified schema, making cross-benchmark analysis and community contribution straightforward.
Each entry in OpenEval is built around an item — the atomic unit of evaluation.
Given the item, each response records the test conditions and outcomes for a single model run.
Each model records the respondent model's identity and associated test settings.
Each metric extracts capability evidence from a response and produces a score.
Every design decision prioritizes transparency, reproducibility, scalability, and community participation.
Results are archived as unique items with associated fields, facilitating customized reassembly and analysis.
A growing repository with 225k+ items and 8M+ responses, covering diverse models, tasks, and constructs.
One JSON schema comprehensively standardizes results across different formats and benchmarks.
Each item captures a full experimental context, including adaptations and environments, for reproducibility.
Every item is timestamped and linked to its original source with metadata for in-depth exploration and auditing.
All data is publicly available and freely reusable. Anyone can conveniently share their own evaluation results.
OpenEval currently indexes results from 64 benchmarks, spanning standard NLP tasks, emerging LLM capabilities, and interdisciplinary evaluations.
OpenEval covers a growing range of open and proprietary models, with the number of models per benchmark ranging from dozens to thousands.
Sharing your evaluation results takes just a few steps — all guidelines and resources are available in our GitHub repository.
Convert your evaluation results into the OpenEval schema. You may refer to the templates, examples, and converters for HELM and OpenLLM results in our GitHub repository.
Run the provided validator to confirm all required fields are present and correctly formatted before submission.
Open a pull request in our GitHub repository with your .json files
and stay informed throughout the review process.
After a brief review, your results are shown in OpenEval and made immediately available to the broader research community.
Still have questions or need guidance? Reach out to the team via GitHub issues or email — we are happy to assist you through the contribution process.