Skip to main content

Envisioning is an emerging technology research institute and advisory.

LinkedInInstagramGitHub

2011 — 2026

research
  • Reports
  • Newsletter
  • Methodology
  • Origins
  • Vocab
services
  • Research Sessions
  • Signals Workspace
  • Bespoke Projects
  • Use Cases
  • Signal Scanfree
  • Readinessfree
impact
  • ANBIMAFuture of Brazilian Capital Markets
  • IEEECharting the Energy Transition
  • Horizon 2045Future of Human and Planetary Security
  • WKOTechnology Scanning for Austria
audiences
  • Innovation
  • Strategy
  • Consultants
  • Foresight
  • Associations
  • Governments
resources
  • Pricing
  • Partners
  • How We Work
  • Data Visualization
  • Multi-Model Method
  • FAQ
  • Security & Privacy
about
  • Manifesto
  • Community
  • Events
  • Support
  • Contact
  • Login
ResearchServicesPricingPartnersAbout
ResearchServicesPricingPartnersAbout
  1. Home
  2. Vocab
  3. JEST (Joint Example Selection for Multimodal Contrastive Learning)

JEST (Joint Example Selection for Multimodal Contrastive Learning)

A multimodal learning method that improves representation quality by strategically selecting training pairs.

Year: 2024Generality: 0.42
Back to Vocab

JEST, or Joint Example Selection for Multimodal Contrastive Learning, is a training methodology designed to improve how models learn shared representations across different data modalities—most commonly images and text. Standard multimodal contrastive learning approaches like CLIP train on massive datasets by pulling semantically matched pairs (e.g., an image and its caption) closer together in embedding space while pushing mismatched pairs apart. JEST extends this paradigm by introducing a principled mechanism for selecting which examples to train on, rather than treating all available data as equally valuable.

The core insight behind JEST is that not all training pairs contribute equally to representation quality. Noisy, ambiguous, or redundant pairs can slow convergence and degrade the learned embedding space. JEST addresses this by scoring candidate batches of examples using a reference model—typically a smaller, pretrained scorer—and preferentially sampling batches that are both learnable and semantically coherent. This joint selection operates at the batch level rather than the individual example level, allowing the method to account for inter-example relationships and diversity within each training step. The result is a data-efficient training regime that extracts more signal from the same underlying dataset.

In practice, JEST has demonstrated significant improvements in training efficiency and downstream task performance on benchmarks involving image-text retrieval, zero-shot classification, and cross-modal understanding. By concentrating gradient updates on high-quality, informative batches, models trained with JEST can match or exceed the performance of standard contrastive approaches while using substantially less compute or data. This makes it particularly relevant as the field grapples with the cost and quality challenges of web-scale multimodal pretraining.

JEST represents a broader trend in machine learning toward data curation and selection as first-class training decisions, rather than afterthoughts. Rather than simply scaling data volume, methods like JEST ask which data should be used and when—a question that has become increasingly central to building capable and efficient foundation models across modalities.

Related

Related

Joint Embedding Architecture
Joint Embedding Architecture

A neural network design that maps multiple data modalities into a shared representational space.

Generality: 0.65
Contrastive Learning
Contrastive Learning

A self-supervised technique that learns representations by comparing similar and dissimilar data pairs.

Generality: 0.68
CLIP (Contrastive Language–Image Pre-training)
CLIP (Contrastive Language–Image Pre-training)

OpenAI model that learns visual concepts by aligning images with natural language descriptions.

Generality: 0.72
JEPA (Joint Embedding Predictive Architecture)
JEPA (Joint Embedding Predictive Architecture)

A self-supervised architecture that predicts representations in embedding space rather than pixel space.

Generality: 0.55
Non-Contrastive Learning
Non-Contrastive Learning

Self-supervised representation learning that requires no negative example pairs.

Generality: 0.62
Multimodal
Multimodal

AI systems that process and integrate multiple data types like text, images, and audio.

Generality: 0.72