Skip to main content

Envisioning is an emerging technology research institute and advisory.

LinkedInInstagramGitHub

2011 — 2026

research
  • Reports
  • Newsletter
  • Methodology
  • Origins
  • Vocab
services
  • Research Sessions
  • Signals Workspace
  • Bespoke Projects
  • Use Cases
  • Signal Scanfree
  • Readinessfree
impact
  • ANBIMAFuture of Brazilian Capital Markets
  • IEEECharting the Energy Transition
  • Horizon 2045Future of Human and Planetary Security
  • WKOTechnology Scanning for Austria
audiences
  • Innovation
  • Strategy
  • Consultants
  • Foresight
  • Associations
  • Governments
resources
  • Pricing
  • Partners
  • How We Work
  • Data Visualization
  • Multi-Model Method
  • FAQ
  • Security & Privacy
about
  • Manifesto
  • Community
  • Events
  • Support
  • Contact
  • Login
ResearchServicesPricingPartnersAbout
ResearchServicesPricingPartnersAbout
  1. Home
  2. Vocab
  3. MMLU (Massive Multitask Language Understanding)

MMLU (Massive Multitask Language Understanding)

A benchmark testing language models across 57 diverse academic and professional subjects.

Year: 2021Generality: 589
Back to Vocab

MMLU is a large-scale evaluation benchmark designed to measure the breadth and depth of knowledge encoded in language models. Introduced by Hendrycks et al. in 2021, it consists of approximately 15,000 multiple-choice questions spanning 57 subjects — ranging from elementary mathematics and world history to professional law, medicine, and ethics. Unlike narrow benchmarks that probe a single capability, MMLU was explicitly designed to stress-test whether models can generalize across the full spectrum of human knowledge domains, making it one of the most comprehensive evaluations available at the time of its release.

The benchmark works by presenting a model with a question and four answer choices, then measuring accuracy across all subjects and an aggregate score. Questions are sourced from real academic and professional exam materials, ensuring they reflect genuine human standards of competence rather than artificially constructed tasks. The difficulty ranges from high-school level to expert-level content, allowing researchers to distinguish between surface-level pattern matching and deeper conceptual understanding. Models are typically evaluated in a few-shot setting, where a small number of example questions are provided in the prompt before the target question.

MMIU matters because it shifted the evaluation conversation from narrow task performance to general-purpose knowledge and reasoning. Early language model benchmarks often saturated quickly — models would reach near-human performance within months — but MMLU's breadth made it far more durable as a meaningful signal of capability. It became a standard reference point for comparing models like GPT-4, Claude, Gemini, and LLaMA, and its results are routinely cited in both academic papers and industry model releases.

Despite its influence, MMLU has known limitations. Critics note that high accuracy can sometimes be achieved through statistical shortcuts rather than genuine understanding, and that multiple-choice format constrains the kinds of reasoning being tested. Contamination concerns — where test questions appear in training data — have also complicated score interpretation. These limitations have spurred the development of successor benchmarks, but MMLU remains a foundational reference in the language model evaluation landscape.

Related

Related

MLLMs (Multimodal Large Language Models)
MLLMs (Multimodal Large Language Models)

AI systems that understand and generate content across text, images, audio, and more.

Generality: 794
LLM (Large Language Model)
LLM (Large Language Model)

Massive neural networks trained on text to understand and generate human language.

Generality: 905
Benchmark
Benchmark

A standardized test used to measure and compare AI model performance.

Generality: 796
LVLMs (Large Vision Language Models)
LVLMs (Large Vision Language Models)

Large AI models that jointly understand and reason over images and text.

Generality: 694
L2M (Large Memory Model)
L2M (Large Memory Model)

A decoder-only Transformer with addressable auxiliary memory enabling reasoning far beyond its attention window.

Generality: 189
LRM (Large Reasoning Models)
LRM (Large Reasoning Models)

Large-scale neural systems explicitly optimized for multi-step, structured reasoning tasks.

Generality: 384