Our Mission

Evaluation benchmarks shape AI—what gets built, what gets improved, and what the world adopts.

OUP introduce L2-Bench, a first-of-its-kind benchmark specifically designed for second‑language (L2) education, to help establish the standard for what good looks like in AI for language learning and assessment globally.

L2-Bench will allow anyone to evaluate AI systems for supporting language learning across diverse education scenarios grounded in learning theory, enabling educators to make informed decisions about AI tools. OUP will use it to rigorously validate our own use of LLMs in our products, ensuring that whatever LLMs we do employ are appropriately evaluated for use specifically in English language teaching.

We have validated L2-Bench with over 200 education practitioners representing the dynamics of global pedagogy, and intend to share our peer-reviewed research methodology and dataset, developed in collaboration with researchers from the University of Oxford, with the Education community in early 2026.

For Educators

Practical AI capability assessment for specific teaching scenarios.

For Institutions

Informed decision-making when selecting AI tools.

For the Field

Accelerated development of effective AI-powered educational systems.

L2-Bench Contributions

Competency-based

A "learning experience designer in second language education" construct spanning 12 core competencies and 31 sub-competencies derived from established language teaching frameworks used to create tasks.

Validated Dataset

Over 1,000 rubric-scored task-response ("Q&A") pairs curated in collaboration with pedagogical experts and validated by over 200 global practitioners to ensure alignment with authentic education contexts.

Open Leaderboard

Transparent rankings of frontier- and top open-source AI models reported with statistical uncertainty quantification to help the community track AI capabilities in language education.

LLM-as-a-Judge Scoring

Integrates recent methodologies for state-of-the-art AI evaluations with automated scoring systems calibrated by expert practitioner scoring.

Context-specific Methodology

Reproducible methodology allowing systematic task and rubric creation for context-specific evaluations across diverse education scenarios.

Peer-reviewed Research

Research papers co-authored with the University of Oxford on L2-Bench methods, validation and results submitted for conference publication.

L2-Bench Example

Explore a demo of how an AI model response is evaluated against a rubric of binary criteria (yes/no) for a task built around the "Lesson Planning" competency.

Loading demonstration...

L2-Bench Competencies

Explore the 12 core competencies and their sub-competencies that define an effective "learning experience designer in second language education" — this term is used to encompass the range of roles that intentionally design the conditions that shape how people learn: teachers, materials developers (content or assessment creators), learning designers, and teacher trainers — aiming to capture what is needed to support language learning effectively.

Competency Hierarchy

Outer: 12 Competencies
Inner: 31 Sub-competencies

L2-Bench Tasks and Rubrics

From competency construct to task rubrics — explore a demo of how we systematically build tasks around a single competency with granular assessment that maintains connection to broader "consensus" in language pedagogy.

L2-Bench Scoring

How we evaluate AI responses to over 1,000 tasks at scale — for each response to a task, a calibrated autoscorer (LLM-as-a-Judge) determines whether a criterion is present (Yes/No) in the response for all criteria in the task rubric, with points applied only if the criterion is present in the response. This makes scoring simple, consistent, and enables reliable automated scoring of open-ended responses for multiple AI models across 1,000+ tasks.

Benchmark Your AI for Language Education

Stay tuned to join the community of researchers, educators, and institutions ready to evaluate AI for language education on L2-Bench.