Definition
A foundation model is a large neural network trained on broad, diverse data at scale, then adapted to specific downstream tasks through fine-tuning, prompting, or in-context learning. The term was coined by the Stanford HAI Center in 2021 to describe models like GPT, CLIP, and DALL-E that serve as a common base for many applications. In robotics, foundation models aim to provide general visual, semantic, and physical understanding that transfers across robot embodiments, tasks, and environments.
The promise of foundation models for robotics is to escape the single-task paradigm. Traditional robot policies are trained from scratch for each new task, each new robot, and each new environment. A foundation model, by contrast, encodes broad knowledge — what objects look like, how they behave physically, what language instructions mean — that can be quickly adapted to new situations with minimal additional data. This mirrors the success of foundation models in NLP and computer vision, where a single pre-trained model powers hundreds of downstream applications.
In practice, robotics foundation models take several forms: Vision-Language-Action models (VLAs) that directly output robot actions, world models that predict future states for planning, and vision-language models used as perception backbones or reward labelers for reinforcement learning. Each represents a different way to inject broad pre-trained knowledge into the robot learning pipeline.
Types of Foundation Models in Robotics
- Vision-Language-Action models (VLAs) — Accept images + language instructions and output robot actions. RT-2, OpenVLA, and π0 are the leading examples. These are the most direct application of foundation models to robot control. See the dedicated VLA & VLM glossary entry for details.
- Generalist policies — Trained on cross-embodiment robot data to provide a base policy that can be fine-tuned to specific robots. Octo (800K episodes, 9 robot types) and RT-X are examples. They may or may not include language conditioning.
- World models — Predict the future state of the environment given the current state and a candidate action. Used for model-based planning and for generating synthetic training data. UniSim and Genie are early examples. World models for robotics must capture physical dynamics (gravity, friction, contact), which remains challenging.
- Vision-language backbones — Pre-trained models like CLIP, SigLIP, and DINOv2 used as frozen or fine-tuned feature extractors in robot perception pipelines. They provide rich visual representations without being robot-specific. Most VLAs use one of these as their vision encoder.
How They Differ from Task-Specific Policies
A task-specific policy (e.g., an ACT model trained to fold towels) learns everything from 50–200 demonstrations of that specific task. It has no knowledge of other tasks, other objects, or other robots. If the task changes even slightly (different towel, different table height), the policy may fail and need retraining.
A foundation model starts with knowledge from millions of internet images, billions of text tokens, and potentially hundreds of thousands of robot demonstrations across many embodiments. When fine-tuned on 50–200 demonstrations of towel folding, it retains its broader understanding: it knows what towels look like from many angles, understands the instruction "fold the towel in half," and has seen similar deformable-object manipulation on other robots. This background knowledge enables generalization to new towel colors, sizes, and table configurations without additional data.
The trade-off is clear: foundation models are larger (3–55B parameters vs. 1–50M for task-specific), slower at inference (5–15 Hz vs. 50–200 Hz), and require more compute for fine-tuning (4–8 A100 GPUs vs. a single RTX 4090). For well-defined, high-frequency tasks in controlled environments, task-specific policies remain the practical choice. For diverse, language-conditioned tasks in unstructured environments, foundation models offer a compelling path.
Data Requirements
Pre-training data: Foundation models for robotics require two types of pre-training data. Internet-scale image-text data (billions of pairs, from datasets like LAION and WebLI) provides visual and semantic understanding. Cross-embodiment robot data (hundreds of thousands of episodes from datasets like Open X-Embodiment) provides physical interaction understanding. Collecting the robot data is the bottleneck — the Open X-Embodiment dataset represents a community-wide effort spanning dozens of labs and years of collection.
Fine-tuning data: Adapting a foundation model to a new robot or task requires 100–1,000 teleoperated demonstrations. This is more than a task-specific policy needs (20–200) but the resulting model is far more generalizable. The demonstrations must include language annotations describing the task.
Why internet-scale data matters: A model trained only on robot data has a narrow view of the world: it knows what objects look like from the robot's camera angles, in the robot's workspace. A model that has also seen millions of internet images knows what a "red cup" looks like from every angle, in every lighting condition, in every context. This visual grounding is what enables zero-shot generalization to novel objects.
Current Limitations
Physical grounding: Foundation models pre-trained on internet data understand visual appearance and language but lack deep physical intuition. They know what a glass looks like but may not predict that it will shatter if gripped too hard. Physical understanding must come from robot interaction data, which is orders of magnitude scarcer than internet data.
Inference latency: Models with 3–55B parameters run at 5–15 Hz on current hardware, compared to 50–200 Hz for lightweight task-specific policies. This limits their use in tasks requiring fast reactive control (catching, high-speed assembly). Model distillation and specialized inference hardware are active research areas.
Embodiment gap: A model trained on data from robot type A does not automatically work on robot type B with different kinematics, cameras, and action spaces. Cross-embodiment training helps but does not eliminate this gap entirely. Fine-tuning on the target embodiment remains necessary.
Key Papers
- Bommasani, R. et al. (2021). "On the Opportunities and Risks of Foundation Models." Stanford HAI. The landmark report that defined the foundation model concept and analyzed its implications across domains, including robotics.
- Brohan, A. et al. (2023). "RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control." CoRL 2023. The first large-scale demonstration that internet pre-training transfers to robot control through a VLA architecture.
- Team, O. X.-E. et al. (2024). "Open X-Embodiment: Robotic Learning Datasets and RT-X Models." ICRA 2024. Created the data foundation for cross-embodiment robotics foundation models by aggregating 970K episodes across 22 robot types.
Related Terms
- VLA & VLM — The most common foundation model architecture for robot control
- Policy Learning — Foundation models are one approach to learning robot policies
- Sim-to-Real Transfer — Foundation models can reduce the sim-to-real gap through pre-trained visual representations
- Action Chunking (ACT) — A task-specific alternative to foundation model approaches
- Diffusion Policy — Used as the action head in some foundation model architectures
Deploy Foundation Models at SVRC
Silicon Valley Robotics Center provides multi-GPU clusters for foundation model fine-tuning, large-scale teleoperation data collection campaigns, and expert guidance on when to use a foundation model versus a lightweight task-specific policy. Our data platform manages datasets in Open X-Embodiment and LeRobot formats for seamless model training.