Skip to main content

Envisioning is an emerging technology research institute and advisory.

LinkedInInstagramGitHub

2011 — 2026

research
  • Observatory
  • Newsletter
  • Methodology
  • Origins
  • Vocab
services
  • Research Sessions
  • Signals Workspace
  • Bespoke Projects
  • Use Cases
  • Readinessfree
impact
  • ANBIMAFuture of Brazilian Capital Markets
  • IEEECharting the Energy Transition
  • Horizon 2045Future of Human and Planetary Security
  • WKOTechnology Scanning for Austria
audiences
  • Innovation
  • Strategy
  • Consultants
  • Foresight
  • Associations
  • Governments
resources
  • Pricing
  • Partners
  • How We Work
  • Data Visualization
  • Multi-Model Method
  • FAQ
  • Security & Privacy
about
  • Manifesto
  • Community
  • Events
  • Support
  • Contact
  • Login
ResearchServicesPricingPartnersAbout
ResearchServicesPricingPartnersAbout
  1. Home
  2. Vocab
  3. WHAM (World and Human Action Model)

WHAM (World and Human Action Model)

A framework for estimating 3D human motion and pose from video using world context.

Year: 2023Generality: 380
Back to Vocab

WHAM (World and Human Action Model) is a machine learning framework designed to recover accurate 3D human body motion from monocular video by jointly modeling human pose, shape, and global trajectory within a coherent world coordinate system. Unlike earlier approaches that estimated body pose in camera-relative coordinates, WHAM explicitly accounts for the camera's motion and grounds human movement in a global reference frame, producing more physically plausible and temporally consistent reconstructions. This makes it particularly valuable for applications in sports analytics, animation, robotics, and any domain requiring realistic human motion capture without specialized hardware.

At its core, WHAM combines a motion encoder trained on large-scale motion capture datasets with a visual feature extractor that processes image evidence frame by frame. These components are fused through a recurrent architecture that maintains temporal context, allowing the model to produce smooth, coherent motion sequences rather than per-frame estimates that flicker or drift. A key innovation is the integration of camera motion estimation, which decouples the movement of the subject from the movement of the camera — a critical distinction when analyzing footage captured with a moving device such as a handheld phone or drone.

WHAM's significance lies in its ability to bridge the gap between controlled laboratory motion capture and real-world video analysis. Traditional motion capture requires subjects to wear marker suits in instrumented environments, making large-scale data collection expensive and impractical. WHAM enables high-quality 3D motion recovery from ordinary video, dramatically lowering the barrier to entry for downstream tasks. The framework was introduced by researchers at the University of Wisconsin-Madison and demonstrated strong performance on standard benchmarks including 3DPW and RICH, outperforming prior methods on metrics measuring global trajectory accuracy.

As video-based human motion understanding becomes increasingly central to embodied AI, virtual reality content creation, and human-computer interaction, frameworks like WHAM represent an important step toward robust, scalable perception of human movement in unconstrained environments. Its combination of data-driven priors with geometric reasoning sets a template for future work in markerless motion capture.

Related

Related

General World Model
General World Model

An AI system that builds broad internal representations of the world to predict and act.

Generality: 745
LAM (Large Action Model)
LAM (Large Action Model)

AI systems that interpret human intent and execute actions directly within digital applications.

Generality: 337
SAM (Segment Anything Model)
SAM (Segment Anything Model)

A promptable foundation model that segments any object in any image.

Generality: 687
VLA (Vision-Language-Action Model)
VLA (Vision-Language-Action Model)

A multimodal architecture combining visual perception, language understanding, and action policies for embodied agents.

Generality: 620
Image-to-Video Model
Image-to-Video Model

AI system that animates static images by synthesizing realistic motion and temporal dynamics.

Generality: 521
Video-to-3D Reconstruction
Video-to-3D Reconstruction

AI technique that converts 2D video footage into detailed three-dimensional digital models.

Generality: 550