L2-Bench
Evaluating AI Capabilities in Language Education
Empowering researchers, institutions, and practitioner communities to make informed decisions about AI for language education.
Our team of pedagogy experts, data scientists and AI researchers have developed the first AI evaluation benchmark specifically designed for second‑language (L2) education, in collaboration with the University of Oxford.
Our Mission
Evaluation benchmarks shape AI—what gets built, what gets improved, and what the world adopts.
OUP introduce L2-Bench, a first-of-its-kind benchmark specifically designed for second‑language (L2) education, to help establish the standard for what good looks like in AI for language learning and assessment globally.
L2-Bench will allow anyone to evaluate AI systems for supporting language learning across diverse education scenarios grounded in learning theory, enabling educators to make informed decisions about AI tools. OUP will use it to rigorously validate our own use of LLMs in our products, ensuring that whatever LLMs we do employ are appropriately evaluated for use specifically in English language teaching.
We have validated L2-Bench with over 200 education practitioners representing the dynamics of global pedagogy, and intend to share our peer-reviewed research methodology and dataset, developed in collaboration with researchers from the University of Oxford, with the Education community in early 2026.
For Educators
Practical AI capability assessment for specific teaching scenarios.
For Institutions
Informed decision-making when selecting AI tools.
For the Field
Accelerated development of effective AI-powered educational systems.
L2-Bench Contributions
Competency-based
A "learning experience designer in second language education" construct spanning 12 core competencies and 31 sub-competencies derived from established language teaching frameworks used to create tasks.
Validated Dataset
Over 1,000 rubric-scored task-response ("Q&A") pairs curated in collaboration with pedagogical experts and validated by over 200 global practitioners to ensure alignment with authentic education contexts.
Open Leaderboard
Transparent rankings of frontier- and top open-source AI models reported with statistical uncertainty quantification to help the community track AI capabilities in language education.
LLM-as-a-Judge Scoring
Integrates recent methodologies for state-of-the-art AI evaluations with automated scoring systems calibrated by expert practitioner scoring.
Context-specific Methodology
Reproducible methodology allowing systematic task and rubric creation for context-specific evaluations across diverse education scenarios.
Peer-reviewed Research
Research papers co-authored with the University of Oxford on L2-Bench methods, validation and results submitted for conference publication.
L2-Bench Example
Explore a demo of how an AI model response is evaluated against a rubric of binary criteria (yes/no) for a task built around the "Lesson Planning" competency.
L2-Bench Competencies
Explore the 12 core competencies and their sub-competencies that define an effective "learning experience designer in second language education" — this term is used to encompass the range of roles that intentionally design the conditions that shape how people learn: teachers, materials developers (content or assessment creators), learning designers, and teacher trainers — aiming to capture what is needed to support language learning effectively.
Competency Hierarchy
L2-Bench Tasks and Rubrics
From competency construct to task rubrics — explore a demo of how we systematically build tasks around a single competency with granular assessment that maintains connection to broader "consensus" in language pedagogy.
L2-Bench Scoring
How we evaluate AI responses to over 1,000 tasks at scale — for each response to a task, a calibrated autoscorer (LLM-as-a-Judge) determines whether a criterion is present (Yes/No) in the response for all criteria in the task rubric, with points applied only if the criterion is present in the response. This makes scoring simple, consistent, and enables reliable automated scoring of open-ended responses for multiple AI models across 1,000+ tasks.
Benchmark Your AI for Language Education
Stay tuned to join the community of researchers, educators, and institutions ready to evaluate AI for language education on L2-Bench.