JEPA (Joint Embedding Predictive Architecture)

Joint Embedding Predictive Architecture (JEPA) is a self-supervised learning framework in which a model learns by predicting abstract representations of data rather than reconstructing raw inputs. Proposed by Yann LeCun as a cornerstone of his vision for human-level AI, JEPA encodes two related views or segments of an input — such as different patches of an image or different time steps in a sequence — into a shared embedding space. A predictor network then learns to map one encoded representation to another, with the key insight being that prediction happens entirely in latent space, not in pixel or token space.

This design choice is deliberate and consequential. Generative models that reconstruct raw inputs must account for every irrelevant detail — the exact texture of a surface, the precise color of a background — which can distract from learning semantically meaningful structure. By predicting in embedding space, JEPA sidesteps this problem: the model is free to discard low-level noise and focus on higher-level patterns that are actually predictive. A stop-gradient or exponential moving average target encoder (similar to techniques used in BYOL and DINO) is typically used to stabilize training and prevent representational collapse.

Image-JEPA (I-JEPA), introduced by Meta AI in 2023, demonstrated that this approach could learn strong visual representations without relying on hand-crafted data augmentations, outperforming many contrastive and generative baselines on downstream tasks. Video-JEPA (V-JEPA) extended the framework to temporal prediction across video frames. More broadly, JEPA represents a philosophical departure from the dominant paradigm of large generative models: rather than learning to produce outputs, it learns to predict the world's structure in a compact, abstract form — a property LeCun argues is essential for building efficient, generalizable world models that can support planning and reasoning.