A large-scale pretrained model providing general-purpose capabilities across diverse robotic tasks.
A Robotics Foundation Model (RFM) is a large-scale, pretrained neural network designed to encode broad knowledge about perception, motor control, manipulation, and environment interaction in a form that can be adapted to a wide range of robotic applications. Drawing direct inspiration from foundation models in natural language processing and computer vision, RFMs aim to serve as a reusable backbone that downstream robotic systems can fine-tune or prompt rather than train from scratch. This paradigm shift reflects growing recognition that the data efficiency and generalization problems plaguing robotics may be partially addressed by scaling up pretraining across diverse embodied experiences.
RFMs typically learn from heterogeneous data sources — including robot teleoperation demonstrations, simulation rollouts, video of human activity, and sensor logs — to build representations that transfer across robot morphologies and task domains. Architecturally, many RFMs adopt transformer-based designs, sometimes incorporating multimodal inputs such as camera images, depth maps, proprioceptive signals, and natural language instructions. Models like RT-2, Octo, and OpenVLA exemplify this approach, using vision-language pretraining to ground robotic policies in semantic understanding of the physical world.
The appeal of RFMs lies in their potential to dramatically reduce the cost of deploying robots in new settings. Rather than collecting thousands of task-specific demonstrations for every new environment, practitioners can fine-tune a pretrained RFM with relatively few examples. This mirrors the impact that BERT and GPT had on NLP, where a single pretrained model became the starting point for hundreds of downstream applications. For robotics, the stakes are particularly high because data collection is physically expensive and safety-critical.
Despite their promise, RFMs face challenges that do not arise as sharply in purely digital domains. Physical embodiment introduces distribution shift between simulation and the real world, variability in hardware, and the need for real-time inference under strict latency constraints. Evaluating generalization remains difficult because robotic tasks are harder to benchmark at scale than text or image tasks. Nevertheless, RFMs represent one of the most active and consequential research frontiers in modern AI, with significant investment from both academic labs and industry.