Skip to main content

Envisioning is an emerging technology research institute and advisory.

LinkedInInstagramGitHub

2011 — 2026

research
  • Observatory
  • Newsletter
  • Methodology
  • Origins
  • Vocab
services
  • Research Sessions
  • Signals Workspace
  • Bespoke Projects
  • Use Cases
  • Readinessfree
impact
  • ANBIMAFuture of Brazilian Capital Markets
  • IEEECharting the Energy Transition
  • Horizon 2045Future of Human and Planetary Security
  • WKOTechnology Scanning for Austria
audiences
  • Innovation
  • Strategy
  • Consultants
  • Foresight
  • Associations
  • Governments
resources
  • Pricing
  • Partners
  • How We Work
  • Data Visualization
  • Multi-Model Method
  • FAQ
  • Security & Privacy
about
  • Manifesto
  • Community
  • Events
  • Support
  • Contact
  • Login
ResearchServicesPricingPartnersAbout
ResearchServicesPricingPartnersAbout
  1. Home
  2. Vocab
  3. Multi-Agent Reinforcement Learning

Multi-Agent Reinforcement Learning

RL training agents in environments with multiple simultaneous interacting participants

Year: 2016Generality: 750Added: May 19, 2026
Back to Vocab

Multi-agent reinforcement learning (MARL) is a form of reinforcement learning where multiple agents learn concurrently in the same environment, each pursuing their own objectives while affecting the states and learning signals of others.

In MARL, the environment is inherently non-stationary from any single agent's perspective — the optimal policy for one agent changes as other agents adapt their behavior. This creates complex dynamics of cooperation, competition, and negotiation that do not arise in single-agent settings.

MARL provides a scalable mechanism for generating diverse interaction data in multi-agent simulations. As the number of participants increases, the joint interaction space grows combinatorially, and passively collected demonstrations cover an increasingly small fraction of meaningful interactions.

Agents and world models can co-evolve through MARL, continuously pushing one another into increasingly difficult regimes and generating training data from emergent failure modes.