A comprehensive list of excellent research papers, models, datasets, and other resources on Vision-Language-Action (VLA) models in robotics. The relevant survey paper will be released soon. Contributions are welcome!
- 1. What are VLA Models in Robotics?
- 2. Survey papers
- 3. Key VLA Models and Research Papers
- 4. Datasets and Benchmarks
- 5. Challenges and Future Directions
- 6. Related Awesome Lists
-
Definition: Vision-Language-Action (VLA) models are a class of multimodal AI systems specifically designed for robotics and Embodied AI. They integrate visual perception (from cameras/sensors), natural language understanding (from text or voice commands), and action generation (physical movements or digital tasks) into a unified framework. Unlike traditional robotic systems that often treat perception, planning, and control as separate modules, VLAs aim for end-to-end or tightly integrated processing, similar to how the human brain processes these modalities simultaneously. The term "VLA" gained prominence with the introduction of the RT-2 model. Generally, a VLA is defined as any model capable of processing multimodal inputs (vision, language) to generate robotic actions for completing embodied tasks.
-
Core Concepts: The basic idea is to leverage the powerful capabilities of large models (LLMs and VLMs) pre-trained on internet-scale data and apply them to robot control. This involves "grounding" language instructions and visual perception information in the physical world to generate appropriate robot actions. The goal is to achieve greater versatility, dexterity, generalization ability, and robustness compared to traditional methods or early reinforcement learning approaches, enabling robots to work effectively in complex, unstructured environments like homes.
-
Key Components:
- Vision Encoder: Processes raw visual input (images, videos, sometimes 3D data), using architectures like ViT, CLIP encoders, DINOv2, SigLIP to extract meaningful features (object recognition, spatial reasoning).
- Language Understanding: Employs LLM components (such as Llama, PaLM, GPT variants) to process natural language instructions, map commands to context, and perform reasoning.
- Action Decoder/Policy: Generates robot actions based on integrated visual and language understanding (e.g., end-effector pose, joint velocities, gripper commands, base movements). This is a significant differentiator between VLAs and VLMs, involving techniques like action tokenization, diffusion models, or direct regression.
- Alignment Mechanisms: Uses strategies like projection layers and cross-attention to bridge the gap between different modalities, aligning visual, language, and action representations.
-
Relationship to VLMs and Embodied AI: VLAs are a specialized category within the field of Embodied AI. They extend Vision-Language Models (VLMs) by explicitly incorporating action generation capabilities. VLMs primarily focus on understanding and generating text based on visual input, while VLAs leverage this understanding to interact with the physical world.
-
Evolution from VLM Adaptation to Integrated Systems: Early VLA research focused mainly on adapting existing VLMs by simply fine-tuning them to output action tokens (e.g., the initial concept of RT-2 ). However, the field is moving towards more integrated architectures where the action generation components are more sophisticated and co-designed (e.g., diffusion policies, specialized action modules, hierarchical systems like Helix or NaVILA ). This evolution indicates that the definition of VLA is shifting from merely fine-tuning VLMs to designing specific VLA architectures that better address the unique requirements of robot action generation, while still leveraging the capabilities of VLMs.
-
[2025] A Survey on Efficient Vision-Language-Action Models [paper]
-
[2025] Vision-Language-Action Models for Robotics: A Review Towards Real-World Applications [paper]
-
[2025] Pure Vision Language Action (VLA) Models: A Comprehensive Survey [paper]
-
[2025] Large VLM-based Vision-Language-Action Models for Robotic Manipulation: A Survey [paper] [project]
-
[2025] A Survey on Vision-Language-Action Models: An Action Tokenization Perspective[paper]
-
[2025] Foundation Model Driven Robotics: A Comprehensive Review [paper]
-
[2025] A Survey on Vision-Language-Action Models for Autonomous Driving [paper] [project]
-
[2025] Parallels Between VLA Model Post-Training and Human Motor Learning: Progress, Challenges, and Trends [paper] [project]
-
[2025] A Survey on Vision-Language-Action Models for Embodied AI. [paper]
-
[2025] Foundation Models in Robotics: Applications, Challenges, and the Future [paper] [project]
-
[2025] Vision Language Action Models in Robotic Manipulation: A Systematic Review [paper]
-
[2025] Vision-Language-Action Models: Concepts, Progress, Applications and Challenges [paper]
-
[2025] OpenHelix: A Short Survey, Empirical Analysis, and Open-Source Dual-System VLA Model for Robotic Manipulation [paper][project]
-
[2025] Exploring Embodied Multimodal Large Models: Development, Datasets, and Future Directions [paper]
-
[2025] Multimodal Fusion and Vision-Language Models: A Survey for Robot Vision [paper] [project]
-
[2025] Generative Artificial Intelligence in Robotic Manipulation: A Survey [paper] [project]
-
[2025] Neural Brain: A Neuroscience-inspired Framework for Embodied Agents [paper] [project]
-
[2024] Aligning Cyber Space with Physical World: A Comprehensive Survey on Embodied AI. [paper]
-
[2024] A Survey on Robotics with Foundation Models: toward Embodied AI. [paper]
-
[2024] What Foundation Models can Bring for Robot Learning in Manipulation: A Survey. [paper]
-
[2024] Towards Generalist Robot Learning from Internet Video: A Survey. [paper]
-
[2024] Large Multimodal Agents: A Survey. [paper]
-
[2024] A Survey on Integration of Large Language Models with Intelligent Robots. [paper]
-
[2024] Vision-Language Models for Vision Tasks: A Survey. [paper]
-
[2024] A Survey of Embodied Learning for Object-Centric Robotic Manipulation [paper]
-
[2024] Vision-language navigation: a survey and taxonomy [paper]
-
[2023] Toward general-purpose robots via foundation models: A survey and meta-analysis. [paper]
-
[2023] Robot learning in the era of foundation models: A survey. [paper]
This section is the heart of the resource, listing specific VLA models and influential research papers. Papers are first categorized by major application area, then by key technical contributions. A paper/model may appear in multiple subsections if it is relevant to several categories.
| Model Name | Key Contribution / Features | Base VLM / Architecture | Action Generation Method | Paper/ Project / Code |
|---|---|---|---|---|
| RT-1 | DetailsFirst large-scale Transformer robot model; Demonstrates scalability on multi-task real-world data; Action discretization |
Transformer (EfficientNet-B3 vision) | Action binning + Token output | arxiv / Project / Code |
| RT-2 | DetailsTransfers web knowledge of VLMs to robot control; Joint fine-tuning of VLM to output action tokens; Shows emergent generalization |
PaLI-X / PaLM-E (Transformer) | Action binning + Token output | arxiv / Project |
| PaLM-E | DetailsEmbodied multimodal language model; Injects continuous sensor data (image, state) into pre-trained LLM; Usable for sequential manipulation planning, VQA, etc. |
PaLM (Transformer) | Outputs subgoals or action descriptions | ICML / Project |
| OpenVLA | DetailsOpen-source 7B parameter VLA; Based on Llama 2; Trained on OpenX dataset; Outperforms RT-2-X; Shows good generalization and PEFT ability |
Llama 2 (DINOv2 + SigLIP vision) | Action binning + Token output | arxiv / Project / Code / HF |
| Helix | DetailsGeneral-purpose VLA for humanoid robots; Hierarchical architecture (System 1/2); Full-body control; Multi-robot collaboration; Onboard deployment |
Custom VLM (System 2) + Visuomotor Policy (System 1) | Continuous action output (System 1) | Paper / Project |
| π0 (Pi-Zero) | DetailsGeneral-purpose VLA; Uses Flow Matching to generate continuous action trajectories (50Hz); Cross-platform training (7 platforms, 68 tasks) |
PaliGemma (Transformer) + Action Expert | Flow Matching | arXiv / Project / Code / HF |
| Octo | DetailsGeneral-purpose robot model; Trained on OpenX dataset; Flexible input/output conditioning; Often used as a baseline |
Transformer (ViT) | Action binning + Token output / Diffusion Head | arXiv/ Project / Code |
| SayCan | DetailsGrounds LLM planning in robot affordances; Uses LLM to score skill relevance + value function to score executability |
PaLM (Transformer) + Value Function | Selects pre-defined skills (high-level planner) | arXiv / Project / Code |
| NaVILA | DetailsTwo-stage framework for legged robot VLN; High-level VLA outputs mid-level language actions, low-level vision-motor policy executes |
InternVL-Chat-V1.5 (VLM) + Locomotion Policy (RL) | Mid-level language action output (VLA) | arXiv / Project |
| VLAS | DetailsFirst end-to-end VLA with direct integration of speech commands; Based on LLaVA; Three-stage fine-tuning for voice commands; Supports personalized tasks (Voice RAG) |
LLaVA (Transformer) + Speech Encoder | Action binning + Token output | arXiv |
| CoT-VLA | DetailsIncorporates explicit Visual Chain-of-Thought (Visual CoT); Predicts future goal images before generating actions; Hybrid attention mechanism |
Llama 2 (ViT vision) | Action binning + Token output (after predicting visual goals) | arXiv / Project |
| TinyVLA | DetailsCompact, fast, and data-efficient VLA; Requires no pre-training; Uses small VLM + diffusion policy decoder |
MobileVLM V2 / Moondream2 + Diffusion Policy Decoder | Diffusion Policy | arXiv / Project |
| CogACT | DetailsComponentized VLA architecture; Specialized action module (Diffusion Action Transformer) conditioned on VLM output; Significantly outperforms OpenVLA / RT-2-X |
InternVL-Chat-V1.5 (VLM) + Diffusion Action Transformer | Diffusion Policy | arXiv / Project |
| TLA | DetailsTactile-Language-Action (TLA) model; sequential tactile feedback via cross-modal language grounding to enable robust policy generation in contact-intensive scenarios. |
Qwen2 7B + LoRA + Qwen2-VL | Qwen2 | arXiv / Project |
| OpenVLA-OFT | DetailsOptimized Fine-Tuning (OFT) |
Llama 2 (DINOv2 + SigLIP vision) | L1 regression | arXiv |
| RDT | DetailsRobotics Diffusion |
InternVL-Chat-V1.5 (VLM) + Diffusion Action Transformer | Diffusion Policy | arXiv |
| π0.5 | DetailsOpen-world generalist VLA; Combines high-level VLM reasoning with π0 low-level dexterity; Web-scale + robot data co-training; Excels at long-horizon tasks in unseen environments |
PaliGemma-based VLM + π0 Action Expert | Flow Matching (Hierarchical) | arXiv / Project |
| GR00T N1 | DetailsNVIDIA's open foundation model for humanoid robots; Dual-system (System 1 fast / System 2 slow); Sim-to-real with Isaac; Full-body whole-body control |
Custom VLM + Diffusion Policy | Diffusion Policy | arXiv / Code |
| Gemini Robotics | DetailsGoogle DeepMind's multimodal robot foundation model; Built on Gemini 2.0; Safety-aware; Multi-embodiment generalization; World understanding + dexterous control |
Gemini 2.0 (Multimodal Transformer) | Continuous action output | Report |
| DexVLA | DetailsScaling VLA for dexterous manipulation across embodiments; Embodiment curriculum learning; Diffusion action module; Strong on bimanual and multi-finger tasks |
VLM + Diffusion Action Module | Diffusion Policy | arXiv / Project |
| ABot-M0 | DetailsA novel VLA foundation model based on action manifold learning |
Qwen3-VL (VLM) + Action Manifold Learning DiT | Action Manifold Learning + Flow Matching | arXiv / Project |
Focuses on tasks involving interaction with objects, ranging from simple pick-and-place to complex, dexterous, long-horizon activities. This is a major application area for VLA research.
-
[2026] Observing and Controlling Features in Vision-Language-Action Models [paper] (Stanford, Marco Pavone)
-
[2026] Pretrained Vision-Language-Action Models are Surprisingly Resistant to Forgetting in Continual Learning [paper] (UT Austin, Yuke Zhu)
-
[2026] EgoScale: Scaling Dexterous Manipulation with Diverse Egocentric Human Data [paper] (UC Berkeley, UT Austin — Trevor Darrell, Yuke Zhu, Linxi Fan)
-
[2026] FAVLA: A Force-Adaptive Fast-Slow VLA model for Contact-Rich Robotic Manipulation [paper]
-
[2026] Self-Correcting VLA: Online Action Refinement via Sparse World Imagination [paper]
-
[2026] LiLo-VLA: Compositional Long-Horizon Manipulation via Linked Object-Centric Policies [paper]
-
[2026] DySL-VLA: Efficient Vision-Language-Action Model Inference via Dynamic-Static Layer-Skipping for Robot Manipulation [paper]
-
[2025] GR-RL: Going Dexterous and Precise for Long-Horizon Robotic Manipulation [paper] [project]
-
[2025] EvoVLA: Self-Evolving Vision-Language-Action Model [paper] [project] [code]
-
[2025] OmniVLA: Physically-Grounded Multimodal VLA with Unified Multi-Sensor Perception for Robotic Manipulation [paper]
-
[2025] π0.6: a VLA that Learns from Experience [paper]
-
[2025] HiMoE-VLA: Hierarchical Mixture-of-Experts for Generalist Vision-Language-Action Policies [paper] [code]
-
[2025] Mixture of Horizons in Action Chunking [paper] [project] [code]
-
[2025] VLA-0: Building State-of-the-Art VLAs with Zero Modification [paper] [project] [code]
-
[2025] Wall-OSS: Igniting VLMs toward the Embodied Space [project] [paper] [code]
-
[2025] Genie Envisioner: A Unified World Foundation Platform for Robotic Manipulation [paper] [project] [code]
-
[2025] Discrete Diffusion VLA: Bringing Discrete Diffusion to Action Decoding in Vision-Language-Action Policies [paper]
-
[2025] UniVLA: Unified Vision-Language-Action Model [paper] [code]
-
[2025] Gemini Robotics On-Device brings AI to local robotic devices [report]
-
[2025] EgoVLA: Learning Vision-Language-Action Models from Egocentric Human Videos [paper][Project]
-
[2025]GeoVLA: Empowering 3D Representations in Vision-Language-Action Models[Project]
-
[2025] DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge [paper][Code][Project]
-
[2025] TriVLA: A Triple-System-Based Unified Vision-Language-Action Model for General Robot Control [paper][Project]
-
[2025] VOTE: Vision-Language-Action Optimization with Trajectory Ensemble Voting [paper]
-
[2025] V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning [paper] [project] [code]
-
[2025] VLA-RL: Towards Masterful and General Robotic Manipulation with Scalable Reinforcement Learning [paper][Code]
-
[2025] CEED-VLA: Consistency Vision-Language-Action Model with Early-Exit Decoding [paper][Project]
-
[2025] Knowledge Insulating Vision-Language-Action Models: Train Fast, Run Fast, Generalize Better [paper] [project]
-
[2025] SwitchVLA: Execution-Aware Task Switching for Vision-Language-Action Models [paper][Project]
-
[2025] Helix: A Vision-Language-Action Model for Generalist Humanoid Control [project]
-
[2025] CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models [paper] [project]
-
[2025] Hi Robot: Open-Ended Instruction Following with Hierarchical Vision-Language-Action Models [paper] [project]
-
[2025] Interactive Post-Training for Vision-Language-Action Models (RIPT-VLA) [paper] [project]
-
[2025] UP-VLA: A Unified Understanding and Prediction Model for Embodied Agent [paper]
-
[2025] NORA: A Small Open-Sourced Generalist Vision Language Action Model for Embodied Tasks [paper] [project]
-
[2025] DexVLA: Scaling Vision-Language-Action Models for Dexterous Manipulation Across Embodiments [paper] [project]
-
[2025] Shake-VLA: Shake, Stir, and Pour with a Dual-Arm Robot: A Vision-Language-Action Model for Automated Cocktail Making [paper]
-
[2025] VLA Model-Expert Collaboration: Enhancing Vision-Language-Action Models with Human Corrections via Shared Autonomy [paper] [project]
-
[2025] FAST: Efficient Action Tokenization for Vision-Language-Action Models [paper] [project]
-
[2025] HybridVLA: Integrating Diffusion and Autoregressive Action Prediction for Generalist Robot Control [paper] [project] [code]
-
[2025] Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success [paper] [project(OpenVLA-OFT)]
-
[2025] RoboBrain: A Unified Brain Model for Robotic Manipulation from Abstract to Concrete [paper] [project] [code]
-
[2025] GRAPE: Generalizing Robot Policy via Preference Alignment [paper] [Project][Code]
-
[2025] OTTER: A Vision-Language-Action Model with Text-Aware Visual Feature Extraction [paper] [project]
-
[2025] PointVLA: Injecting the 3D World into Vision-Language-Action Models [paper] [project]
-
[2025] AgiBot World Colosseo: Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems [paper] [project]
-
[2025] DexGraspVLA: A Vision-Language-Action Framework Towards General Dexterous Grasping [paper] [project]
-
[2025] Interleave-VLA: Enhancing Robot Manipulation with Interleaved Image-Text Instructions [paper]
-
[2025] EnerVerse: Envisioning Embodied Future Space for Robotics Manipulation [paper]
-
[2025] VLA-Cache: Towards Efficient Vision-Language-Action Model via Adaptive Token Caching in Robotic Manipulation[paper]
-
[2025] SafeVLA: Towards Safety Alignment of Vision-Language-Action Model via Safe Reinforcement Learning [paper] [project]
-
[2025] Accelerating Vision-Language-Action Model Integrated with Action Chunking via Parallel Decoding [paper(PD-VLA)]
-
[2025] Refined Policy Distillation: From VLA Generalists to RL Experts [paper(RPD)]
-
[2025] MoLe-VLA: Dynamic Layer-skipping Vision Language Action Model via Mixture-of-Layers for Efficient Robot Manipulation [paper][project] [Code]
-
[2025] MoManipVLA: Transferring Vision-language-action Models for General Mobile Manipulation [paper] [project]
-
[2025] Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy [paper] [project]
-
[2025] Gemini Robotics: Bringing AI into the Physical World [report]
-
[2025] ConRFT: A Reinforced Fine-tuning Method for VLA Models via Consistency Policy, RSS 2025 [Paper][Project]
-
[2025] RoboGround: Robotic Manipulation with Grounded Vision-Language Priors [paper] [project]
-
[2025] ViSA-Flow: Accelerating Robot Skill Learning via Large-Scale Video Semantic Action Flow [paper] [project]
-
[2025] Interleave-VLA: Enhancing Robot Manipulation with Interleaved Image-Text Instructions [paper] [project]
-
[2025] OpenHelix: A Short Survey, Empirical Analysis, and Open-Source Dual-System VLA Model for Robotic Manipulation [paper] [project]
-
[2025] ReinboT: Amplifying Robot Visual-Language Manipulation with Reinforcement Learning
-
[2025] CrayonRobo: Object-Centric Prompt-Driven Vision-Language-Action Model for Robotic Manipulation [paper]
-
[2025] Learning to Act Anywhere with Task-centric Latent Actions [paper(UniVLA)] [project]
-
[2025] Pixel Motion as Universal Representation for Robot Control [paper] [project]
-
[2024] OpenVLA: An Open-Source Vision-Language-Action Model [paper] [code]
-
[2024] π₀ (Pi-Zero): Our First Generalist Policy [project] [code]
-
[2024] Octo: An Open-Source Generalist Robot Policy [paper] [project] [Code]
-
[2024] RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation [paper]
-
[2024] ReKep: Spatio-Temporal Reasoning of Relational Keypoint Constraints for Robotic Manipulation [paper] [project] [code]
-
[2024] OK-Robot: What Really Matters in Integrating Open-Knowledge Models for Robotics [paper] [project] [code]
-
[2024] 3D-VLA: A 3D Vision-Language-Action Generative World Model [paper] [code]
-
[2024] TinyVLA: Towards Fast, Data-Efficient Vision-Language-Action Models for Robotic Manipulation [paper] [project]
-
[2024] CogACT: Componentized Vision-Language-Action Models for Robotic Control [paper] [project]
-
[2024] RoboMM: All-in-One Multimodal Large Model for Robotic Manipulation [paper][Code]
-
[2024] Diffusion-VLA: Scaling Robot Foundation Models via Unified Diffusion and Autoregression [paper] [project]
-
[2024] HiRT: Enhancing Robotic Control with Hierarchical Robot Transformers [paper]
-
[2024] GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation [paper]
-
[2024] DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution [paper] [code]
-
[2024] RT-Affordance: Affordances are Versatile Intermediate Representations for Robot Manipulation [paper]
-
[2024] Moto: Latent Motion Token as the Bridging Language for Robot Manipulation [paper] [project]
-
[2024] Bi-VLA: Vision-Language-Action Model-Based System for Bimanual Robotic Dexterous Manipulations [paper]
-
[2024] An Embodied Generalist Agent in 3D World [paper]
-
[2024] Towards Synergistic, Generalized, and Efficient Dual-System for Robotic Manipulation [paper][project] [code]
-
[2023] RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control [paper] [project]
-
[2023] PaLM-E: An Embodied Multimodal Language Model [paper] [project]
-
[2023] VIMA: General Robot Manipulation with Multimodal Prompts [paper] [project]
-
[2023] VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models [paper] [project] [code]
-
[2022] RT-1: Robotics Transformer for Real-World Control at Scale [paper] [code]
-
[2022] Do As I Can, Not As I Say: Grounding Language in Robotic Affordances (SayCan) [paper] [project] [code]
-
[2022] Perceiver-Actor: A Multi-Task Transformer for Robotic Manipulation [paper] [project] [code]
Focuses on tasks where a robot moves through an environment based on visual input and language instructions. Includes Vision-Language Navigation (VLN) and applications for legged robots.
-
[2026] OpenFrontier: General Navigation with Visual-Language Grounded Frontiers [paper]
-
[2026] PROSPECT: Unified Streaming Vision-Language Navigation via Semantic--Spatial Fusion and Latent Predictive Representation [paper]
-
[2026] History-Conditioned Spatio-Temporal Visual Token Pruning for Efficient Vision-Language Navigation [paper]
-
[2026] AnyCamVLA: Zero-Shot Camera Adaptation for Viewpoint Robust Vision-Language-Action Models [paper] [project]
-
[2025] OctoNav: Towards Generalist Embodied Navigation [paper] [project]
-
[2025] Do Visual Imaginations Improve Vision-and-Language Navigation Agents? [paper] [project]
-
[2025] SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Models [paper] [project]
-
[2025] FlexVLN: Flexible Adaptation for Diverse Vision-and-Language Navigation Tasks [paper]
-
[2025] MoManipVLA: Transferring Vision-language-action Models for General Mobile Manipulation [paper] [project]
-
[2024] NaVILA: Legged Robot Vision-Language-Action Model for Navigation [paper] [project]
-
[2024] QUAR-VLA: Vision-Language-Action Model for Quadruped Robots [paper] [project]
-
[2024] NaviLLM: Towards Learning a Generalist Model for Embodied Navigation [paper] [code]
-
[2024] NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation [paper] [project]
-
[2023] VLN-SIG: Improving Vision-and-Language Navigation by Generating Future-View Image Semantics [paper] [project]
-
[2023] PanoGen: Text-Conditioned Panoramic Environment Generation for Vision-and-Language Navigation [paper] [project]
Focuses on enabling more natural and effective interactions between humans and robots, often using language (text or speech) as the primary interface.
-
[2026] The Robot's Inner Critic: Self-Refinement of Social Behaviors through VLM-based Replanning [paper] [project]
-
[2026] Morphology-Consistent Humanoid Interaction through Robot-Centric Video Synthesis [paper]
-
[2026] Not an Obstacle for Dog, but a Hazard for Human: A Co-Ego Navigation System for Guide Dog Robots [paper]
-
[2025] Unveiling the Potential of Vision-Language-Action Models with Open-Ended Multimodal Instructions [paper] (OE-VLA)
-
[2025] VLAS: Vision-Language-Action Model With Speech Instructions For Customized Robot Manipulation [paper]
-
[2025] Shake-VLA: Vision-Language-Action Model-Based System for Bimanual Robotic Manipulations and Liquid Mixing [paper]
-
[2025] VLA Model-Expert Collaboration for Bi-directional Manipulation Learning [paper]
-
[2025] Helix: A Vision-Language-Action Model for Generalist Humanoid Control [project]
-
[2025] CognitiveDrone: A VLA Model and Evaluation Benchmark for Real-Time Cognitive Task Solving and Reasoning in UAVs [paper][project]
-
[2024] TalkWithMachines: Enhancing Human-Robot Interaction for Interpretable Industrial Robotics Through Large/Vision Language Models [paper][project]
Focuses on using VLA/LLM components for high-level task decomposition, planning, and reasoning, often bridging the gap between complex instructions and low-level actions.
-
[2026] HALO: A Unified Vision-Language-Action Model for Embodied Multimodal Chain-of-Thought Reasoning [paper]
-
[2026] Self-Correcting VLA: Online Action Refinement via Sparse World Imagination [paper]
-
[2026] LiLo-VLA: Compositional Long-Horizon Manipulation via Linked Object-Centric Policies [paper]
-
[2025]MemER: Scaling Up Memory for Robot Control via Experience Retrieval [paper] [project]
-
[2025] Long-VLA: Unleashing Long-Horizon Capability of Vision Language Action Model for Robot Manipulation [paper] [project]
-
[2025] ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning [paper] [project]
-
[2025] Agentic Robot: A Brain-Inspired Framework for Vision-Language-Action Models in Embodied Agents [paper] [project]
-
[2025] Training Strategies for Efficient Embodied Reasoning (ECoT-Lite) [paper]
-
[2025] OneTwoVLA: A Unified Vision-Language-Action Model with Adaptive Reasoning [paper] [project]
-
[2025] Vision-Language-Action Model with Open-World Embodied Reasoning from Pretrained Knowledge [paper] [project]
-
[2025] Hume: Introducing System-2 Thinking in Visual-Language-Action Model [paper] [project]
-
[2025] Robotic Control via Embodied Chain-of-Thought Reasoning [paper] [project][code]
-
[2025] GR00T N1: An Open Foundation Model for Generalist Humanoid Robots [paper] [Code]
-
[2025] Gemini Robotics: Bringing AI into the Physical World [report]
-
[2025] GRAPE: Generalizing Robot Policy via Preference Alignment [paper]
-
[2025] HAMSTER: Hierarchical Action Models For Open-World Robot Manipulation [paper]
-
[2025] π0.5: A Vision-Language-Action Model with Open-World Generalization[paper] [project]
-
[2025] Hi Robot: Open-Ended Instruction Following with Hierarchical Vision-Language-Action Models [paper] [project]
-
[2025] CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models [paper] [project]
-
[2025] Embodied-R: Collaborative Framework for Activating Embodied Spatial Reasoning in Foundation Models via Reinforcement Learning [paper] [project]
-
[2025] RoboBrain: A Unified Brain Model for Robotic Manipulation from Abstract to Concrete [paper] [project] [code]
-
[2025] Probing a Vision-Language-Action Model for Symbolic States and Integration into a Cognitive Architecture [paper]
-
[2024] RoboMamba: Efficient Vision-Language-Action Model for Robotic Reasoning and Manipulation [paper] [project]
-
[2024] Improving Vision-Language-Action Models via Chain-of-Affordance [paper] [project]
-
[2023] PaLM-E: An Embodied Multimodal Language Model [paper] [project]
-
[2023] EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought [paper] [code]
-
[2022] LLM-Planner: Few-Shot Grounded Planning with Large Language Models [paper] [project]
-
[2022] Code as Policies: Language Model Programs for Embodied Control [paper] [project]
-
[2022] Inner Monologue: Embodied Reasoning through Planning with Language Models [paper] [project]
-
[2022] Do As I Can, Not As I Say: Grounding Language in Robotic Affordances (SayCan) [paper] [project] [code]
-
[2026] PhysiFlow: Physics-Aware Humanoid Whole-Body VLA via Multi-Brain Latent Flow Matching and Robust Tracking [paper]
-
[2026] Cognition to Control - Multi-Agent Learning for Human-Humanoid Collaborative Transport [paper]
-
[2025] GR00T N1: An Open Foundation Model for Generalist Humanoid Robots [paper] [Code]
-
[2025] Helix: A Vision-Language-Action Model for Generalist Humanoid Control [project]
-
[2025] Humanoid-VLA: Towards Universal Humanoid Control with Visual Integration [paper]
-
[2025] Adversarial Attacks on Robotic Vision Language Action Models [paper][Code]
-
[2025] Vision-Language-Action Model with Open-World Embodied Reasoning from Pretrained Knowledge [paper][Project](ChatVLA-2)
-
[2025] ChatVLA: Unified Multimodal Understanding and Robot Control with Vision-Language-Action Model [paper][Project]
-
[2025] OpenDriveVLA: Towards End-to-end Autonomous Driving with Large Vision Language Action Model [paper][Project][Code]
-
[2024] OccLLaMA: An Occupancy-Language-Action Generative World Model for Autonomous Driving [paper]
-
[2024] DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models [paper][Project]
-
[2024] EMMA: End-to-End Multimodal Model for Autonomous Driving [paper][Code]
Focuses on the core neural network architectures used in VLA models.
-
Transformer-based: The dominant architecture, leveraging self-attention mechanisms to integrate vision, language, and action sequences. Applications:
-
Diffusion-based: Primarily for the action generation component, utilizing the ability of diffusion models to model complex distributions. Often combined with a Transformer backbone. Applications:
- Diffusion Policy, Octo (can use diffusion head), 3D Diffuser Actor, SUDD, MDT, RDT-1B, DexVLA, DiVLA, TinyVLA, Hybrid VLA+Diffusion.
-
Hierarchical / Decoupled: Architectures that separate high-level reasoning/planning (often VLM/LLM-based) from low-level control/execution (which may be a separate policy). Applications:
-
State-Space Models (SSM): Emerging architectures like Mamba are being explored for their efficiency. Applications:
-
Mixture-of-Experts (MoE / MoLE): Using sparsely activated expert modules for task adaptation or efficiency. Applications:
-
Recent Architecture Updates (2026.03):
Focuses on how robot actions are represented (e.g., discrete tokens vs. continuous vectors) and how models generate them. This is a key area differentiating VLAs from VLMs.
-
Action Tokenization / Discretization: Representing continuous actions (e.g., joint angles, end-effector pose) as discrete tokens, often via binning. Used in early/many Transformer-based VLAs like RT-1, RT-2 to fit the language modeling paradigm. May have limitations in precision and high-frequency control.
-
Continuous Action Regression: Directly predicting continuous action vectors. Sometimes used in conjunction with other methods or implemented via specific heads. L1 regression is used in OpenVLA-OFT.
-
Diffusion Policies for Actions: Modeling action generation as a denoising diffusion process. Good at capturing multi-modality and continuous spaces. Applications:
- Diffusion Policy, Octo (diffusion head), SUDD, MDT, RDT-1B, DexVLA, DiVLA, TinyVLA. Can be slow due to iterative sampling.
- Discrete Diffusion VLA
-
Flow Matching: An alternative generative method for continuous actions, used in π0 for efficient, high-frequency (50Hz) trajectory generation.
-
Action Chunking: Predicting multiple future actions in a single step, for efficiency and temporal consistency. Increases action dimensionality and inference time when using AR decoding. Applications:
-
Better Decoding Strategy: Techniques to speed up autoregressive decoding of action chunks.
-
Specialized Tokenizers: Developing better ways to tokenize continuous action sequences. Applications:
-
Point-based Actions: Using VLMs to predict keypoints or goal locations rather than full trajectories. Applications:
-
Mid-Level Language Actions: Generating actions as natural language commands to be consumed by a lower-level policy. Applications:
Focuses on how VLA models are trained and adapted.
-
Imitation Learning (IL) / Behavior Cloning (BC): Dominant paradigm, training VLAs to mimic expert demonstrations (often from teleoperation). Heavily reliant on large-scale, diverse, high-quality datasets. Performance is often limited by the quality of the demonstrations. Applications:
- RT-1, RT-2, OpenVLA (pre-training part), Octo, Diffusion Policy, etc.
-
Reinforcement Learning (RL): Used to fine-tune VLAs or train components, allowing models to learn from interaction and potentially exceed demonstrator performance. Challenges include stability and sample efficiency with large models. Applications:
- iRe-VLA (iterative RL/SFT), MoRE (RL objective for MoE VLAs handling mixed data), RPD (RL-based policy distillation), ConRFT (RL fine-tuning with consistency policies), SafeVLA (Constrained RL for safety), RIPT-VLA,VLA-RL,SimpleVLA-RL.
- Robot-R1
- WoVR (World Models as Reliable Simulators for Post-Training VLA Policies with RL)
-
Pre-training & Fine-tuning: Standard approach, involving pre-training on large datasets (web data for VLM backbones, large robot datasets like OpenX for VLAs) and then fine-tuning on specific tasks or robots.
-
Parameter-Efficient Fine-Tuning (PEFT): Techniques like LoRA to efficiently adapt large VLAs without retraining the entire model, crucial for practical deployment and customization. MoRE uses LoRA modules as experts.
-
Distillation: Training smaller, faster models (students) to mimic the behavior of larger, slower models (teachers). Applications:
-
Curriculum Learning: Structuring the learning process, e.g., by embodiment complexity. Applications:
- DexVLA uses embodied curriculum.
-
Learning from Mixed-Quality Data: Using techniques (e.g., RL in MoRE) to learn effectively even when demonstration data is suboptimal or contains failures.
-
Reasoning-Augmented Training / Inference: Enhancing policy quality via explicit intermediate reasoning. Applications:
Focuses on input data types beyond standard RGB images and text used by VLAs, and how they ground these inputs.
-
Integrating Speech: Control via spoken commands, potentially capturing nuances missed by text. Requires handling the speech modality directly or via ASR. Applications:
-
Integrating 3D Vision: Using point clouds, voxels, depth maps, or implicit representations (NeRFs, 3DGS) to provide richer spatial understanding. Applications:
- GeoVLA, 3D-VLA, PerAct, Act3D, RVT, RVT-2, RoboUniView, DP3, 3D Diffuser Actor, LEO, 3D-LLM, LLM-Grounder, SpatialVLA.
- Bridge VLA
-
Integrating Proprioception / State: Incorporating the robot's own state (joint angles, velocities, end-effector pose) as input. Common in many policies, explicitly mentioned in VLAS, PaLM-E, π0 (evaluation requires Simpler fork with proprioception support). OpenVLA initially lacked this, noted as a limitation/future work.
-
Multimodal Prompts: Handling instructions that include images or video in addition to text. Applications:
- VIMA.
-
Grounding: The process of linking language descriptions or visual perceptions to specific entities, locations, or actions in the physical world or robot representation. Addressed via various techniques like similarity matching, leveraging common-sense knowledge, multimodal alignment, or interaction. LLM-Grounder focuses on open-vocabulary 3D visual grounding.
This section lists key resources for training and evaluating VLA models. Large-scale, diverse datasets and standardized benchmarks are crucial for progress in the field.
| Name | Type | Focus Area | Key Features / Environment | Link | Key Publication |
|---|---|---|---|---|---|
| Open X-Embodiment (OpenX) | Dataset | General Manipulation | DetailsAggregates 20+ datasets, cross-embodiment/task/environment, >1M trajectories |
Project | arXiv |
| DROID | Dataset | Real-world Manipulation | DetailsLarge-scale human-collected data (500+ tasks, 26k hours) |
Project | arxiv |
| CALVIN | Dataset / Benchmark | Long-Horizon Manipulation | DetailsLong-horizon tasks with language conditioning, Franka arm, PyBullet simulation |
Project | arxiv |
| QUARD | Dataset | Quadruped Robot Tasks | DetailsLarge-scale multi-task dataset (sim + real) for navigation and manipulation |
Project | ECCV 2024 |
| BEHAVIOR-1K | Dataset / Benchmark | Household Activities | Details1000 simulated human household activities |
Project | arxiv |
| Isaac Sim / Orbit / OmniGibson | Simulator | High-fidelity Robot Simulation | DetailsNVIDIA Omniverse-based, physically realistic |
Isaac-sim, Orbit, OmniGibson | - |
| Habitat Sim | Simulator | Embodied AI Navigation | DetailsFlexible, high-performance 3D simulator |
Project | arxiv |
| MuJoCo | Simulator | Physics Engine | DetailsPopular physics engine for robotics and RL |
Website | - |
| PyBullet | Simulator | Physics Engine | DetailsOpen-source physics engine, used for CALVIN, etc. |
Website | - |
| ManiSkill (1, 2, 3) | Benchmark | Generalizable Manipulation Skills | DetailsLarge-scale manipulation benchmark based on SAPIEN |
Project | arxiv |
| Meta-World | Benchmark | Multi-task / Meta RL Manipulation | Details50 Sawyer arm manipulation tasks, MuJoCo |
Project | arxiv |
| RLBench | Benchmark | Robot Learning Manipulation | Details100+ manipulation tasks, CoppeliaSim (V-REP) |
Project | arxiv |
| VLN-CE / R2R / RxR | Benchmark | Vision-Language Nav | DetailsStandard VLN benchmarks, often run in Habitat |
VLN-CE,R2R-EnvDrop,RxR | - |
Large-scale datasets of robot interaction trajectories, often with accompanying language instructions and visual observations. Crucial for training general-purpose policies via imitation learning.
-
[2026] HortiMulti: A Multi-Sensor Dataset for Localisation and Mapping in Horticultural Polytunnels [paper]
-
Open X-Embodiment (OpenX) [Project] - Open X-Embodiment Collaboration.
Details
A massive, standardized dataset aggregating data from 20+ existing robot datasets, spanning diverse embodiments, tasks, and environments. Used to train major VLAs like RT-X, Octo, OpenVLA, π0. Contains over 1 million trajectories. -
BridgeData V2 [Project] - Walke, H., et al.
Details
Large dataset collected on a WidowX robot, used for OpenVLA evaluation. -
DROID [Project] - Manuelli, L., et al.
Details
Large-scale, diverse, human-collected manipulation dataset (500+ tasks, 26k hours). Used to fine-tune/evaluate OpenVLA, π0. -
RH20T [Project] - Shao, L., et al.
Details
Comprehensive dataset with 110k robot clips, 110k human demonstrations, and 140+ tasks. -
CALVIN (Composing Actions from Language and Vision) [Project] - Mees, O., et al.
Details
Benchmark and dataset for long-horizon language-conditioned manipulation with a simulated Franka arm in PyBullet. -
QUARD (QUAdruped Robot Dataset) [Project] - Tang, J., et al.
Details
arge-scale multi-task dataset (sim + real) for quadruped navigation and manipulation, released with QUAR-VLA. Contains 348k sim + 3k real clips or 246k sim + 3k real clips. -
RoboNet [Project] - Dasari, S., et al.
Details
Early large-scale dataset aggregating data from multiple robot platforms. -
BEHAVIOR-1K [Project] - Srivastava, S., et al.
Details
Dataset of 1000 simulated human household activities, useful for high-level task understanding. -
SQA & CSI Datasets [arXiv]- Zhao, W., et al.
Details
Curated datasets with speech instructions, released with the VLAS model, for speech-vision-action alignment and fine-tuning. -
Libero [Project] - Li, Z., et al.
Details
* Benchmark suite for robot lifelong learning with procedurally generated tasks. Used in π0 fine-tuning examples. -
D4RL (Datasets for Deep Data-Driven Reinforcement Learning) [Code] - Fu, J., et al.
Details
Standardized datasets for offline RL research, potentially useful for RL-based VLA methods.
Physics-based simulators used to train agents, generate synthetic data, and evaluate policies in controlled settings before real-world deployment.
-
NVIDIA Isaac Sim / Orbit / OmniGibson [Isaac-sim, Orbit, OmniGibson].
Details
High-fidelity, physically realistic simulators based on NVIDIA Omniverse. Used for QUAR-VLA, ReKep, ARNOLD, etc. -
Habitat Sim [Project] - Facebook AI Research (Meta AI).
Details
Flexible, high-performance 3D simulator for Embodied AI research, especially navigation. -
MuJoCo (Multi-Joint dynamics with Contact) [Project].
Details
Popular physics engine widely used for robot simulation and RL benchmarks (dm\_control, robosuite, Meta-World, RoboHive). -
PyBullet [Project.]
Details
Open-source physics engine, used for CALVIN and other benchmarks (panda-gym). -
SAPIEN [Project.]
Details
Physics simulator focused on articulated objects and interaction. Used for the ManiSkill benchmark. -
Gazebo [Project.]
Details
Widely used open-source robot simulator, especially in the ROS ecosystem. -
Webots [Project].
Details
Open-source desktop robot simulator. -
Genesis (GitHub).
Details
A newer platform aimed at general robot/Embodied AI simulation. -
UniSim [arXiv] - Yang, G., et al
Details
Learns interactive simulators from real-world videos.
Standardized suites of environments and tasks used to evaluate and compare the performance of VLA models and other robot learning algorithms.
-
[2026] NavTrust: Benchmarking Trustworthiness for Embodied Navigation [paper] [project] [code]
-
[2026] HUGE-Bench: A Benchmark for High-Level UAV Vision-Language-Action Tasks [paper]
-
[2026] Can LLMs Prove Robotic Path Planning Optimality? A Benchmark for Research-Level Algorithm Verification [paper]
-
CALVIN [Project].
Details
Benchmark for long-horizon language-conditioned manipulation. -
ManiSkill (1, 2, 3) [Project]
Details
Large-scale benchmark for generalizable manipulation skills, based on SAPIEN. -
Meta-World [Project].
Details
Multi-task and meta-RL benchmark with 50 different manipulation tasks using a Sawyer arm in MuJoCo. -
RLBench [Project].
Details
Large-scale benchmark with 100+ manipulation tasks in CoppeliaSim (V-REP). -
Franka Kitchen [GitHub].
Details
dm\_control-based benchmark involving kitchen tasks with a Franka arm. Used in iRe-VLA. -
LIBERO [Project].
Details
Benchmark for lifelong/continual learning in robot manipulation. -
VIMA-Bench [Project].
Details
Multimodal few-shot prompting benchmark for robot manipulation. -
BEHAVIOR-1K [Project].
Details
Benchmark focused on long-horizon household activities. -
VLN-CE / R2R / RxR [VLN-CE,R2R-EnvDrop,RxR].
Details
Standard benchmarks for Vision-Language Navigation, often run in Habitat. NaVILA is evaluated on these. -
Safety-CHORES [paper].
Details
A new simulated benchmark with safety constraints, proposed for evaluating safe VLA learning. -
OK-VQA [Project].
Details
Visual question answering benchmark requiring external knowledge, used to evaluate the general VLM abilities of [PaLM-E](https://arxiv.org/abs/2303.03378).
-
Data Efficiency & Scalability: Reducing reliance on massive, expensive, expert-driven datasets. Improving the ability to learn from limited, mixed-quality, or internet-sourced data. Efficiently scaling models and training processes.
- Future directions: Improved sample efficiency (RL, self-supervision), sim-to-real transfer, automated data generation, efficient architectures (SSMs, MoEs), data filtering/weighting.
-
Inference Speed & Real-Time Control: Current large VLAs may be too slow for the high-frequency control loops needed for dynamic tasks or dexterous manipulation.
-
Robustness & Reliability: Ensuring consistent performance across variations in environment, lighting, object appearance, disturbances, and unexpected events. Current models can be brittle.
- Future directions: Adversarial training, improved grounding, better 3D understanding, closed-loop feedback, anomaly detection, incorporating physical priors, testing frameworks (VLATest).
-
Generalization: Improving the ability to generalize to new tasks, objects, instructions, environments, and embodiments beyond the training distribution. This is a core promise of VLAs, but remains a challenge.
- Future directions: Training on more diverse data (OpenX), effective utilization of VLM pre-training knowledge, compositional reasoning, continual/lifelong learning, better action representations.
-
Safety & Alignment: Explicitly incorporating safety constraints to prevent harm to the robot, the environment, or humans. Ensuring alignment with user intent. Crucial for real-world deployment.
- Future directions: Constrained reinforcement learning (SafeVLA), formal verification, human oversight mechanisms, robust failure detection/recovery, ethical considerations.
-
Dexterity & Contact-Rich Tasks: Improving performance on tasks requiring fine motor skills, precise force control, and handling complex object interactions. Current VLAs often lag behind specialized methods in this area.
- Future directions: Better action representations (FAST, Diffusion), integration of tactile sensing, improved physical understanding/simulation, hybrid control approaches.
-
Reasoning & Long-Horizon Planning: Enhancing the ability for multi-step reasoning, long-horizon planning, and handling complex instructions.
- Future directions: Hierarchical architectures, explicit planning modules, chain-of-thought reasoning (visual/textual), memory mechanisms, world models.
-
Multimodality Expansion: Integrating richer sensory inputs beyond vision + language, such as audio/speech, touch, force, 3D.
- Future directions: Developing architectures and alignment techniques for diverse modalities.
- Awesome-VLA:
- https://github.com/yueen-ma/Awesome-VLA
- https://github.com/OpenHelix-robot/awesome-dual-system-vla
- https://github.com/Orlando-CS/Awesome-VLA
- https://github.com/AoqunJin/Awesome-VLA-Post-Training
- https://github.com/OpenHelix-Team/Awesome-VLA-RL
- https://github.com/jonyzhang2023/awesome-embodied-vla-va-vln
- https://github.com/keon/awesome-physical-ai
- Awesome-Embodied-AI:
- Awesome-Robot-Learning:
- Awesome-Vision-Language-Models:
If you find this repository useful, please consider citing this list:
@misc{liu2025vlaroboticspaperslist,
title = {Awesome-VLA-Robotics},
author = {Jiaqi Liu},
journal = {GitHub repository},
url = {https://github.com/Jiaaqiliu/Awesome-VLA-Robotics},
year = {2025},
}