Skip to main content

Envisioning is an emerging technology research institute and advisory.

LinkedInInstagramGitHub

2011 — 2026

research
  • Reports
  • Newsletter
  • Methodology
  • Origins
  • Vocab
services
  • Research Sessions
  • Signals Workspace
  • Bespoke Projects
  • Use Cases
  • Signal Scanfree
  • Readinessfree
impact
  • ANBIMAFuture of Brazilian Capital Markets
  • IEEECharting the Energy Transition
  • Horizon 2045Future of Human and Planetary Security
  • WKOTechnology Scanning for Austria
audiences
  • Innovation
  • Strategy
  • Consultants
  • Foresight
  • Associations
  • Governments
resources
  • Pricing
  • Partners
  • How We Work
  • Data Visualization
  • Multi-Model Method
  • FAQ
  • Security & Privacy
about
  • Manifesto
  • Community
  • Events
  • Support
  • Contact
  • Login
ResearchServicesPricingPartnersAbout
ResearchServicesPricingPartnersAbout
  1. Home
  2. Research
  3. Wintermute
  4. Multimodal Foundation Models

Multimodal Foundation Models

AI models that process text, images, audio, and video in a unified representation space
Back to WintermuteView interactive version

Multimodal foundation models process and understand information across multiple modalities—text, images, audio, video, and potentially actions or internal states—within a unified representation space. These models learn to map different types of information into shared embedding spaces, enabling them to understand relationships between modalities, translate between them, and reason about concepts that span multiple sensory channels.

This innovation addresses the limitation of single-modality AI systems, which can only process one type of information and struggle with tasks requiring understanding across modalities. By learning unified representations, multimodal models can understand that a picture of a cat, the word "cat," and the sound of meowing all refer to the same concept, enabling more sophisticated understanding and interaction. Models like GPT-4V, Claude, and Gemini demonstrate multimodal capabilities, processing text and images together.

The technology is fundamental to creating AI agents that can interact with the real world, which inherently involves multiple sensory modalities. As AI systems become more capable and are deployed in applications requiring real-world interaction—from robotics to virtual assistants to content creation—multimodal understanding becomes essential. The technology continues to evolve, with research exploring how to incorporate additional modalities like audio, video, actions, and potentially even internal states or proprioception, moving toward more comprehensive world understanding.

TRL
7/9Operational
Impact
5/5
Investment
5/5
Category
Software

Related Organizations

Google DeepMind logo
Google DeepMind

United Kingdom · Research Lab

100%

Developers of the Gemini family of models, which are trained from the start to be multimodal across text, images, video, and audio.

Developer
OpenAI logo

OpenAI

United States · Company

100%

Creator of GPT-4o, a natively multimodal model capable of reasoning across audio, vision, and text in real-time.

Developer
Twelve Labs logo
Twelve Labs

United States · Startup

95%

Building foundation models specifically for video understanding, enabling semantic search and summarization of video content.

Developer
NVIDIA logo
NVIDIA

United States · Company

90%

Developing foundation models for robotics (Project GR00T) and vision-language models like VILA.

Developer
Reka AI

United States · Startup

90%

Founded by former DeepMind and Meta researchers to build state-of-the-art multimodal language models (Reka Core, Flash, Edge).

Developer
01.AI

China · Startup

85%

Founded by Kai-Fu Lee, developing the Yi series of open-source models, including Yi-VL (Vision Language).

Developer
Alibaba Cloud logo
Alibaba Cloud

China · Company

85%

Cloud computing arm of Alibaba Group.

Developer
Luma AI logo
Luma AI

United States · Startup

85%

Creators of Dream Machine, a high-quality video generation model, and 3D capture technology.

Developer
Salesforce Research logo

Salesforce Research

United States · Research Lab

80%

Developers of the BLIP (Bootstrapping Language-Image Pre-training) family of models and XGen.

Developer

Supporting Evidence

Evidence data is not available for this technology yet.

Connections

Hardware
Hardware
Modular Robotics Platforms

Robots built from swappable components and sensors for rapid task reconfiguration

TRL
6/9
Impact
4/5
Investment
4/5

Book a research session

Bring this signal into a focused decision sprint with analyst-led framing and synthesis.
Research Sessions