Skip to content

Jiaaqiliu/Awesome-VLA-Robotics

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

131 Commits
 
 
 
 
 
 

Repository files navigation

Awesome VLA for Robotics

A comprehensive list of excellent research papers, models, datasets, and other resources on Vision-Language-Action (VLA) models in robotics. The relevant survey paper will be released soon. Contributions are welcome!

Table of Contents

1. What are VLA Models in Robotics?

  • Definition: Vision-Language-Action (VLA) models are a class of multimodal AI systems specifically designed for robotics and Embodied AI. They integrate visual perception (from cameras/sensors), natural language understanding (from text or voice commands), and action generation (physical movements or digital tasks) into a unified framework. Unlike traditional robotic systems that often treat perception, planning, and control as separate modules, VLAs aim for end-to-end or tightly integrated processing, similar to how the human brain processes these modalities simultaneously. The term "VLA" gained prominence with the introduction of the RT-2 model. Generally, a VLA is defined as any model capable of processing multimodal inputs (vision, language) to generate robotic actions for completing embodied tasks.

  • Core Concepts: The basic idea is to leverage the powerful capabilities of large models (LLMs and VLMs) pre-trained on internet-scale data and apply them to robot control. This involves "grounding" language instructions and visual perception information in the physical world to generate appropriate robot actions. The goal is to achieve greater versatility, dexterity, generalization ability, and robustness compared to traditional methods or early reinforcement learning approaches, enabling robots to work effectively in complex, unstructured environments like homes.

  • Key Components:

    • Vision Encoder: Processes raw visual input (images, videos, sometimes 3D data), using architectures like ViT, CLIP encoders, DINOv2, SigLIP to extract meaningful features (object recognition, spatial reasoning).
    • Language Understanding: Employs LLM components (such as Llama, PaLM, GPT variants) to process natural language instructions, map commands to context, and perform reasoning.
    • Action Decoder/Policy: Generates robot actions based on integrated visual and language understanding (e.g., end-effector pose, joint velocities, gripper commands, base movements). This is a significant differentiator between VLAs and VLMs, involving techniques like action tokenization, diffusion models, or direct regression.
    • Alignment Mechanisms: Uses strategies like projection layers and cross-attention to bridge the gap between different modalities, aligning visual, language, and action representations.
  • Relationship to VLMs and Embodied AI: VLAs are a specialized category within the field of Embodied AI. They extend Vision-Language Models (VLMs) by explicitly incorporating action generation capabilities. VLMs primarily focus on understanding and generating text based on visual input, while VLAs leverage this understanding to interact with the physical world.

  • Evolution from VLM Adaptation to Integrated Systems: Early VLA research focused mainly on adapting existing VLMs by simply fine-tuning them to output action tokens (e.g., the initial concept of RT-2 ). However, the field is moving towards more integrated architectures where the action generation components are more sophisticated and co-designed (e.g., diffusion policies, specialized action modules, hierarchical systems like Helix or NaVILA ). This evolution indicates that the definition of VLA is shifting from merely fine-tuning VLMs to designing specific VLA architectures that better address the unique requirements of robot action generation, while still leveraging the capabilities of VLMs.

2. Survey papers

  • [2025] A Survey on Efficient Vision-Language-Action Models [paper]

  • [2025] Vision-Language-Action Models for Robotics: A Review Towards Real-World Applications [paper]

  • [2025] Pure Vision Language Action (VLA) Models: A Comprehensive Survey [paper]

  • [2025] Large VLM-based Vision-Language-Action Models for Robotic Manipulation: A Survey [paper] [project]

  • [2025] A Survey on Vision-Language-Action Models: An Action Tokenization Perspective[paper]

  • [2025] Foundation Model Driven Robotics: A Comprehensive Review [paper]

  • [2025] A Survey on Vision-Language-Action Models for Autonomous Driving [paper] [project]

  • [2025] Parallels Between VLA Model Post-Training and Human Motor Learning: Progress, Challenges, and Trends [paper] [project]

  • [2025] A Survey on Vision-Language-Action Models for Embodied AI. [paper]

  • [2025] Foundation Models in Robotics: Applications, Challenges, and the Future [paper] [project]

  • [2025] Vision Language Action Models in Robotic Manipulation: A Systematic Review [paper]

  • [2025] Vision-Language-Action Models: Concepts, Progress, Applications and Challenges [paper]

  • [2025] OpenHelix: A Short Survey, Empirical Analysis, and Open-Source Dual-System VLA Model for Robotic Manipulation [paper][project]

  • [2025] Exploring Embodied Multimodal Large Models: Development, Datasets, and Future Directions [paper]

  • [2025] Multimodal Fusion and Vision-Language Models: A Survey for Robot Vision [paper] [project]

  • [2025] Generative Artificial Intelligence in Robotic Manipulation: A Survey [paper] [project]

  • [2025] Neural Brain: A Neuroscience-inspired Framework for Embodied Agents [paper] [project]

  • [2024] Aligning Cyber Space with Physical World: A Comprehensive Survey on Embodied AI. [paper]

  • [2024] A Survey on Robotics with Foundation Models: toward Embodied AI. [paper]

  • [2024] What Foundation Models can Bring for Robot Learning in Manipulation: A Survey. [paper]

  • [2024] Towards Generalist Robot Learning from Internet Video: A Survey. [paper]

  • [2024] Large Multimodal Agents: A Survey. [paper]

  • [2024] A Survey on Integration of Large Language Models with Intelligent Robots. [paper]

  • [2024] Vision-Language Models for Vision Tasks: A Survey. [paper]

  • [2024] A Survey of Embodied Learning for Object-Centric Robotic Manipulation [paper]

  • [2024] Vision-language navigation: a survey and taxonomy [paper]

  • [2023] Toward general-purpose robots via foundation models: A survey and meta-analysis. [paper]

  • [2023] Robot learning in the era of foundation models: A survey. [paper]

3. Key VLA Models and Research Papers

This section is the heart of the resource, listing specific VLA models and influential research papers. Papers are first categorized by major application area, then by key technical contributions. A paper/model may appear in multiple subsections if it is relevant to several categories.

3.1 Quick Glance at Key VLA Models

Model Name Key Contribution / Features Base VLM / Architecture Action Generation Method Paper/ Project / Code
RT-1
DetailsFirst large-scale Transformer robot model; Demonstrates scalability on multi-task real-world data; Action discretization
Transformer (EfficientNet-B3 vision) Action binning + Token output arxiv / Project / Code
RT-2
DetailsTransfers web knowledge of VLMs to robot control; Joint fine-tuning of VLM to output action tokens; Shows emergent generalization
PaLI-X / PaLM-E (Transformer) Action binning + Token output arxiv / Project
PaLM-E
DetailsEmbodied multimodal language model; Injects continuous sensor data (image, state) into pre-trained LLM; Usable for sequential manipulation planning, VQA, etc.
PaLM (Transformer) Outputs subgoals or action descriptions ICML / Project
OpenVLA
DetailsOpen-source 7B parameter VLA; Based on Llama 2; Trained on OpenX dataset; Outperforms RT-2-X; Shows good generalization and PEFT ability
Llama 2 (DINOv2 + SigLIP vision) Action binning + Token output arxiv / Project / Code / HF
Helix
DetailsGeneral-purpose VLA for humanoid robots; Hierarchical architecture (System 1/2); Full-body control; Multi-robot collaboration; Onboard deployment
Custom VLM (System 2) + Visuomotor Policy (System 1) Continuous action output (System 1) Paper / Project
π0 (Pi-Zero)
DetailsGeneral-purpose VLA; Uses Flow Matching to generate continuous action trajectories (50Hz); Cross-platform training (7 platforms, 68 tasks)
PaliGemma (Transformer) + Action Expert Flow Matching arXiv / Project / Code / HF
Octo
DetailsGeneral-purpose robot model; Trained on OpenX dataset; Flexible input/output conditioning; Often used as a baseline
Transformer (ViT) Action binning + Token output / Diffusion Head arXiv/ Project / Code
SayCan
DetailsGrounds LLM planning in robot affordances; Uses LLM to score skill relevance + value function to score executability
PaLM (Transformer) + Value Function Selects pre-defined skills (high-level planner) arXiv / Project / Code
NaVILA
DetailsTwo-stage framework for legged robot VLN; High-level VLA outputs mid-level language actions, low-level vision-motor policy executes
InternVL-Chat-V1.5 (VLM) + Locomotion Policy (RL) Mid-level language action output (VLA) arXiv / Project
VLAS
DetailsFirst end-to-end VLA with direct integration of speech commands; Based on LLaVA; Three-stage fine-tuning for voice commands; Supports personalized tasks (Voice RAG)
LLaVA (Transformer) + Speech Encoder Action binning + Token output arXiv
CoT-VLA
DetailsIncorporates explicit Visual Chain-of-Thought (Visual CoT); Predicts future goal images before generating actions; Hybrid attention mechanism
Llama 2 (ViT vision) Action binning + Token output (after predicting visual goals) arXiv / Project
TinyVLA
DetailsCompact, fast, and data-efficient VLA; Requires no pre-training; Uses small VLM + diffusion policy decoder
MobileVLM V2 / Moondream2 + Diffusion Policy Decoder Diffusion Policy arXiv / Project
CogACT
DetailsComponentized VLA architecture; Specialized action module (Diffusion Action Transformer) conditioned on VLM output; Significantly outperforms OpenVLA / RT-2-X
InternVL-Chat-V1.5 (VLM) + Diffusion Action Transformer Diffusion Policy arXiv / Project
TLA
DetailsTactile-Language-Action (TLA) model; sequential tactile feedback via cross-modal language grounding to enable robust policy generation in contact-intensive scenarios.
Qwen2 7B + LoRA + Qwen2-VL Qwen2 arXiv / Project
OpenVLA-OFT
DetailsOptimized Fine-Tuning (OFT)
Llama 2 (DINOv2 + SigLIP vision) L1 regression arXiv
RDT
DetailsRobotics Diffusion
InternVL-Chat-V1.5 (VLM) + Diffusion Action Transformer Diffusion Policy arXiv
π0.5
DetailsOpen-world generalist VLA; Combines high-level VLM reasoning with π0 low-level dexterity; Web-scale + robot data co-training; Excels at long-horizon tasks in unseen environments
PaliGemma-based VLM + π0 Action Expert Flow Matching (Hierarchical) arXiv / Project
GR00T N1
DetailsNVIDIA's open foundation model for humanoid robots; Dual-system (System 1 fast / System 2 slow); Sim-to-real with Isaac; Full-body whole-body control
Custom VLM + Diffusion Policy Diffusion Policy arXiv / Code
Gemini Robotics
DetailsGoogle DeepMind's multimodal robot foundation model; Built on Gemini 2.0; Safety-aware; Multi-embodiment generalization; World understanding + dexterous control
Gemini 2.0 (Multimodal Transformer) Continuous action output Report
DexVLA
DetailsScaling VLA for dexterous manipulation across embodiments; Embodiment curriculum learning; Diffusion action module; Strong on bimanual and multi-finger tasks
VLM + Diffusion Action Module Diffusion Policy arXiv / Project
ABot-M0
DetailsA novel VLA foundation model based on action manifold learning
Qwen3-VL (VLM) + Action Manifold Learning DiT Action Manifold Learning + Flow Matching arXiv / Project

3.2 By Application Area

3.2.1 Manipulation

Focuses on tasks involving interaction with objects, ranging from simple pick-and-place to complex, dexterous, long-horizon activities. This is a major application area for VLA research.

2026
  • [2026] Observing and Controlling Features in Vision-Language-Action Models [paper] (Stanford, Marco Pavone)

  • [2026] Pretrained Vision-Language-Action Models are Surprisingly Resistant to Forgetting in Continual Learning [paper] (UT Austin, Yuke Zhu)

  • [2026] EgoScale: Scaling Dexterous Manipulation with Diverse Egocentric Human Data [paper] (UC Berkeley, UT Austin — Trevor Darrell, Yuke Zhu, Linxi Fan)

  • [2026] FAVLA: A Force-Adaptive Fast-Slow VLA model for Contact-Rich Robotic Manipulation [paper]

  • [2026] Self-Correcting VLA: Online Action Refinement via Sparse World Imagination [paper]

  • [2026] LiLo-VLA: Compositional Long-Horizon Manipulation via Linked Object-Centric Policies [paper]

  • [2026] DySL-VLA: Efficient Vision-Language-Action Model Inference via Dynamic-Static Layer-Skipping for Robot Manipulation [paper]

2025
  • [2025] GR-RL: Going Dexterous and Precise for Long-Horizon Robotic Manipulation [paper] [project]

  • [2025] EvoVLA: Self-Evolving Vision-Language-Action Model [paper] [project] [code]

  • [2025] OmniVLA: Physically-Grounded Multimodal VLA with Unified Multi-Sensor Perception for Robotic Manipulation [paper]

  • [2025] π0.6: a VLA that Learns from Experience [paper]

  • [2025] HiMoE-VLA: Hierarchical Mixture-of-Experts for Generalist Vision-Language-Action Policies [paper] [code]

  • [2025] Mixture of Horizons in Action Chunking [paper] [project] [code]

  • [2025] VLA-0: Building State-of-the-Art VLAs with Zero Modification [paper] [project] [code]

  • [2025] Wall-OSS: Igniting VLMs toward the Embodied Space [project] [paper] [code]

  • [2025] Genie Envisioner: A Unified World Foundation Platform for Robotic Manipulation [paper] [project] [code]

  • [2025] GR-3 Technical Report [paper] [project]

  • [2025] Discrete Diffusion VLA: Bringing Discrete Diffusion to Action Decoding in Vision-Language-Action Policies [paper]

  • [2025] UniVLA: Unified Vision-Language-Action Model [paper] [code]

  • [2025] Gemini Robotics On-Device brings AI to local robotic devices [report]

  • [2025] EgoVLA: Learning Vision-Language-Action Models from Egocentric Human Videos [paper][Project]

  • [2025]GeoVLA: Empowering 3D Representations in Vision-Language-Action Models[Project]

  • [2025] DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge [paper][Code][Project]

  • [2025] TriVLA: A Triple-System-Based Unified Vision-Language-Action Model for General Robot Control [paper][Project]

  • [2025] VOTE: Vision-Language-Action Optimization with Trajectory Ensemble Voting [paper]

  • [2025] V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning [paper] [project] [code]

  • [2025] VLA-RL: Towards Masterful and General Robotic Manipulation with Scalable Reinforcement Learning [paper][Code]

  • [2025] CEED-VLA: Consistency Vision-Language-Action Model with Early-Exit Decoding [paper][Project]

  • [2025] Knowledge Insulating Vision-Language-Action Models: Train Fast, Run Fast, Generalize Better [paper] [project]

  • [2025] SwitchVLA: Execution-Aware Task Switching for Vision-Language-Action Models [paper][Project]

  • [2025] Helix: A Vision-Language-Action Model for Generalist Humanoid Control [project]

  • [2025] CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models [paper] [project]

  • [2025] Hi Robot: Open-Ended Instruction Following with Hierarchical Vision-Language-Action Models [paper] [project]

  • [2025] Interactive Post-Training for Vision-Language-Action Models (RIPT-VLA) [paper] [project]

  • [2025] UP-VLA: A Unified Understanding and Prediction Model for Embodied Agent [paper]

  • [2025] NORA: A Small Open-Sourced Generalist Vision Language Action Model for Embodied Tasks [paper] [project]

  • [2025] DexVLA: Scaling Vision-Language-Action Models for Dexterous Manipulation Across Embodiments [paper] [project]

  • [2025] Shake-VLA: Shake, Stir, and Pour with a Dual-Arm Robot: A Vision-Language-Action Model for Automated Cocktail Making [paper]

  • [2025] VLA Model-Expert Collaboration: Enhancing Vision-Language-Action Models with Human Corrections via Shared Autonomy [paper] [project]

  • [2025] FAST: Efficient Action Tokenization for Vision-Language-Action Models [paper] [project]

  • [2025] HybridVLA: Integrating Diffusion and Autoregressive Action Prediction for Generalist Robot Control [paper] [project] [code]

  • [2025] Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success [paper] [project(OpenVLA-OFT)]

  • [2025] RoboBrain: A Unified Brain Model for Robotic Manipulation from Abstract to Concrete [paper] [project] [code]

  • [2025] GRAPE: Generalizing Robot Policy via Preference Alignment [paper] [Project][Code]

  • [2025] OTTER: A Vision-Language-Action Model with Text-Aware Visual Feature Extraction [paper] [project]

  • [2025] PointVLA: Injecting the 3D World into Vision-Language-Action Models [paper] [project]

  • [2025] AgiBot World Colosseo: Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems [paper] [project]

  • [2025] DexGraspVLA: A Vision-Language-Action Framework Towards General Dexterous Grasping [paper] [project]

  • [2025] Interleave-VLA: Enhancing Robot Manipulation with Interleaved Image-Text Instructions [paper]

  • [2025] EnerVerse: Envisioning Embodied Future Space for Robotics Manipulation [paper]

  • [2025] VLA-Cache: Towards Efficient Vision-Language-Action Model via Adaptive Token Caching in Robotic Manipulation[paper]

  • [2025] SafeVLA: Towards Safety Alignment of Vision-Language-Action Model via Safe Reinforcement Learning [paper] [project]

  • [2025] Accelerating Vision-Language-Action Model Integrated with Action Chunking via Parallel Decoding [paper(PD-VLA)]

  • [2025] Refined Policy Distillation: From VLA Generalists to RL Experts [paper(RPD)]

  • [2025] MoLe-VLA: Dynamic Layer-skipping Vision Language Action Model via Mixture-of-Layers for Efficient Robot Manipulation [paper][project] [Code]

  • [2025] MoManipVLA: Transferring Vision-language-action Models for General Mobile Manipulation [paper] [project]

  • [2025] Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy [paper] [project]

  • [2025] Gemini Robotics: Bringing AI into the Physical World [report]

  • [2025] ConRFT: A Reinforced Fine-tuning Method for VLA Models via Consistency Policy, RSS 2025 [Paper][Project]

  • [2025] RoboGround: Robotic Manipulation with Grounded Vision-Language Priors [paper] [project]

  • [2025] ViSA-Flow: Accelerating Robot Skill Learning via Large-Scale Video Semantic Action Flow [paper] [project]

  • [2025] Interleave-VLA: Enhancing Robot Manipulation with Interleaved Image-Text Instructions [paper] [project]

  • [2025] OpenHelix: A Short Survey, Empirical Analysis, and Open-Source Dual-System VLA Model for Robotic Manipulation [paper] [project]

  • [2025] ReinboT: Amplifying Robot Visual-Language Manipulation with Reinforcement Learning

  • [2025] CrayonRobo: Object-Centric Prompt-Driven Vision-Language-Action Model for Robotic Manipulation [paper]

  • [2025] Learning to Act Anywhere with Task-centric Latent Actions [paper(UniVLA)] [project]

  • [2025] Pixel Motion as Universal Representation for Robot Control [paper] [project]

2024
  • [2024] OpenVLA: An Open-Source Vision-Language-Action Model [paper] [code]

  • [2024] π₀ (Pi-Zero): Our First Generalist Policy [project] [code]

  • [2024] Octo: An Open-Source Generalist Robot Policy [paper] [project] [Code]

  • [2024] RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation [paper]

  • [2024] ReKep: Spatio-Temporal Reasoning of Relational Keypoint Constraints for Robotic Manipulation [paper] [project] [code]

  • [2024] OK-Robot: What Really Matters in Integrating Open-Knowledge Models for Robotics [paper] [project] [code]

  • [2024] 3D-VLA: A 3D Vision-Language-Action Generative World Model [paper] [code]

  • [2024] TinyVLA: Towards Fast, Data-Efficient Vision-Language-Action Models for Robotic Manipulation [paper] [project]

  • [2024] CogACT: Componentized Vision-Language-Action Models for Robotic Control [paper] [project]

  • [2024] RoboMM: All-in-One Multimodal Large Model for Robotic Manipulation [paper][Code]

  • [2024] Diffusion-VLA: Scaling Robot Foundation Models via Unified Diffusion and Autoregression [paper] [project]

  • [2024] HiRT: Enhancing Robotic Control with Hierarchical Robot Transformers [paper]

  • [2024] GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation [paper]

  • [2024] DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution [paper] [code]

  • [2024] RT-Affordance: Affordances are Versatile Intermediate Representations for Robot Manipulation [paper]

  • [2024] Moto: Latent Motion Token as the Bridging Language for Robot Manipulation [paper] [project]

  • [2024] Bi-VLA: Vision-Language-Action Model-Based System for Bimanual Robotic Dexterous Manipulations [paper]

  • [2024] An Embodied Generalist Agent in 3D World [paper]

  • [2024] Towards Synergistic, Generalized, and Efficient Dual-System for Robotic Manipulation [paper][project] [code]

2023
  • [2023] RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control [paper] [project]

  • [2023] PaLM-E: An Embodied Multimodal Language Model [paper] [project]

  • [2023] VIMA: General Robot Manipulation with Multimodal Prompts [paper] [project]

  • [2023] VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models [paper] [project] [code]

2022
  • [2022] RT-1: Robotics Transformer for Real-World Control at Scale [paper] [code]

  • [2022] Do As I Can, Not As I Say: Grounding Language in Robotic Affordances (SayCan) [paper] [project] [code]

  • [2022] Perceiver-Actor: A Multi-Task Transformer for Robotic Manipulation [paper] [project] [code]

3.2.2 Navigation and Mobile Manipulation

Focuses on tasks where a robot moves through an environment based on visual input and language instructions. Includes Vision-Language Navigation (VLN) and applications for legged robots.

  • [2026] OpenFrontier: General Navigation with Visual-Language Grounded Frontiers [paper]

  • [2026] PROSPECT: Unified Streaming Vision-Language Navigation via Semantic--Spatial Fusion and Latent Predictive Representation [paper]

  • [2026] History-Conditioned Spatio-Temporal Visual Token Pruning for Efficient Vision-Language Navigation [paper]

  • [2026] AnyCamVLA: Zero-Shot Camera Adaptation for Viewpoint Robust Vision-Language-Action Models [paper] [project]

  • [2025] OctoNav: Towards Generalist Embodied Navigation [paper] [project]

  • [2025] Do Visual Imaginations Improve Vision-and-Language Navigation Agents? [paper] [project]

  • [2025] SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Models [paper] [project]

  • [2025] FlexVLN: Flexible Adaptation for Diverse Vision-and-Language Navigation Tasks [paper]

  • [2025] MoManipVLA: Transferring Vision-language-action Models for General Mobile Manipulation [paper] [project]

  • [2024] NaVILA: Legged Robot Vision-Language-Action Model for Navigation [paper] [project]

  • [2024] QUAR-VLA: Vision-Language-Action Model for Quadruped Robots [paper] [project]

  • [2024] NaviLLM: Towards Learning a Generalist Model for Embodied Navigation [paper] [code]

  • [2024] NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation [paper] [project]

  • [2023] VLN-SIG: Improving Vision-and-Language Navigation by Generating Future-View Image Semantics [paper] [project]

  • [2023] PanoGen: Text-Conditioned Panoramic Environment Generation for Vision-and-Language Navigation [paper] [project]

3.2.3 Human-Robot Interaction (HRI)

Focuses on enabling more natural and effective interactions between humans and robots, often using language (text or speech) as the primary interface.

  • [2026] The Robot's Inner Critic: Self-Refinement of Social Behaviors through VLM-based Replanning [paper] [project]

  • [2026] Morphology-Consistent Humanoid Interaction through Robot-Centric Video Synthesis [paper]

  • [2026] Not an Obstacle for Dog, but a Hazard for Human: A Co-Ego Navigation System for Guide Dog Robots [paper]

  • [2025] Unveiling the Potential of Vision-Language-Action Models with Open-Ended Multimodal Instructions [paper] (OE-VLA)

  • [2025] VLAS: Vision-Language-Action Model With Speech Instructions For Customized Robot Manipulation [paper]

  • [2025] Shake-VLA: Vision-Language-Action Model-Based System for Bimanual Robotic Manipulations and Liquid Mixing [paper]

  • [2025] VLA Model-Expert Collaboration for Bi-directional Manipulation Learning [paper]

  • [2025] Helix: A Vision-Language-Action Model for Generalist Humanoid Control [project]

  • [2025] CognitiveDrone: A VLA Model and Evaluation Benchmark for Real-Time Cognitive Task Solving and Reasoning in UAVs [paper][project]

  • [2024] TalkWithMachines: Enhancing Human-Robot Interaction for Interpretable Industrial Robotics Through Large/Vision Language Models [paper][project]

3.2.4 Task Planning / Reasoning

Focuses on using VLA/LLM components for high-level task decomposition, planning, and reasoning, often bridging the gap between complex instructions and low-level actions.

  • [2026] HALO: A Unified Vision-Language-Action Model for Embodied Multimodal Chain-of-Thought Reasoning [paper]

  • [2026] Self-Correcting VLA: Online Action Refinement via Sparse World Imagination [paper]

  • [2026] LiLo-VLA: Compositional Long-Horizon Manipulation via Linked Object-Centric Policies [paper]

  • [2025]MemER: Scaling Up Memory for Robot Control via Experience Retrieval [paper] [project]

  • [2025] Long-VLA: Unleashing Long-Horizon Capability of Vision Language Action Model for Robot Manipulation [paper] [project]

  • [2025] ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning [paper] [project]

  • [2025] Agentic Robot: A Brain-Inspired Framework for Vision-Language-Action Models in Embodied Agents [paper] [project]

  • [2025] Training Strategies for Efficient Embodied Reasoning (ECoT-Lite) [paper]

  • [2025] OneTwoVLA: A Unified Vision-Language-Action Model with Adaptive Reasoning [paper] [project]

  • [2025] Vision-Language-Action Model with Open-World Embodied Reasoning from Pretrained Knowledge [paper] [project]

  • [2025] Hume: Introducing System-2 Thinking in Visual-Language-Action Model [paper] [project]

  • [2025] Robotic Control via Embodied Chain-of-Thought Reasoning [paper] [project][code]

  • [2025] GR00T N1: An Open Foundation Model for Generalist Humanoid Robots [paper] [Code]

  • [2025] Gemini Robotics: Bringing AI into the Physical World [report]

  • [2025] GRAPE: Generalizing Robot Policy via Preference Alignment [paper]

  • [2025] HAMSTER: Hierarchical Action Models For Open-World Robot Manipulation [paper]

  • [2025] π0.5: A Vision-Language-Action Model with Open-World Generalization[paper] [project]

  • [2025] Hi Robot: Open-Ended Instruction Following with Hierarchical Vision-Language-Action Models [paper] [project]

  • [2025] CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models [paper] [project]

  • [2025] Embodied-R: Collaborative Framework for Activating Embodied Spatial Reasoning in Foundation Models via Reinforcement Learning [paper] [project]

  • [2025] RoboBrain: A Unified Brain Model for Robotic Manipulation from Abstract to Concrete [paper] [project] [code]

  • [2025] Probing a Vision-Language-Action Model for Symbolic States and Integration into a Cognitive Architecture [paper]

  • [2024] RoboMamba: Efficient Vision-Language-Action Model for Robotic Reasoning and Manipulation [paper] [project]

  • [2024] Improving Vision-Language-Action Models via Chain-of-Affordance [paper] [project]

  • [2023] PaLM-E: An Embodied Multimodal Language Model [paper] [project]

  • [2023] EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought [paper] [code]

  • [2022] LLM-Planner: Few-Shot Grounded Planning with Large Language Models [paper] [project]

  • [2022] Code as Policies: Language Model Programs for Embodied Control [paper] [project]

  • [2022] Inner Monologue: Embodied Reasoning through Planning with Language Models [paper] [project]

  • [2022] Do As I Can, Not As I Say: Grounding Language in Robotic Affordances (SayCan) [paper] [project] [code]

3.2.5 Humanoid

  • [2026] PhysiFlow: Physics-Aware Humanoid Whole-Body VLA via Multi-Brain Latent Flow Matching and Robust Tracking [paper]

  • [2026] Cognition to Control - Multi-Agent Learning for Human-Humanoid Collaborative Transport [paper]

  • [2025] GR00T N1: An Open Foundation Model for Generalist Humanoid Robots [paper] [Code]

  • [2025] Helix: A Vision-Language-Action Model for Generalist Humanoid Control [project]

  • [2025] Humanoid-VLA: Towards Universal Humanoid Control with Visual Integration [paper]

3.2.6 Other

  • [2025] Adversarial Attacks on Robotic Vision Language Action Models [paper][Code]

  • [2025] Vision-Language-Action Model with Open-World Embodied Reasoning from Pretrained Knowledge [paper][Project](ChatVLA-2)

  • [2025] ChatVLA: Unified Multimodal Understanding and Robot Control with Vision-Language-Action Model [paper][Project]

  • [2025] OpenDriveVLA: Towards End-to-end Autonomous Driving with Large Vision Language Action Model [paper][Project][Code]

  • [2024] OccLLaMA: An Occupancy-Language-Action Generative World Model for Autonomous Driving [paper]

  • [2024] DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models [paper][Project]

  • [2024] EMMA: End-to-End Multimodal Model for Autonomous Driving [paper][Code]

3.3 By Technical Approach

3.3.1 Model Architectures

Focuses on the core neural network architectures used in VLA models.

3.3.2 Action Representation & Generation

Focuses on how robot actions are represented (e.g., discrete tokens vs. continuous vectors) and how models generate them. This is a key area differentiating VLAs from VLMs.

  • Action Tokenization / Discretization: Representing continuous actions (e.g., joint angles, end-effector pose) as discrete tokens, often via binning. Used in early/many Transformer-based VLAs like RT-1, RT-2 to fit the language modeling paradigm. May have limitations in precision and high-frequency control.

  • Continuous Action Regression: Directly predicting continuous action vectors. Sometimes used in conjunction with other methods or implemented via specific heads. L1 regression is used in OpenVLA-OFT.

  • Diffusion Policies for Actions: Modeling action generation as a denoising diffusion process. Good at capturing multi-modality and continuous spaces. Applications:

  • Flow Matching: An alternative generative method for continuous actions, used in π0 for efficient, high-frequency (50Hz) trajectory generation.

  • Action Chunking: Predicting multiple future actions in a single step, for efficiency and temporal consistency. Increases action dimensionality and inference time when using AR decoding. Applications:

  • Better Decoding Strategy: Techniques to speed up autoregressive decoding of action chunks.

  • Specialized Tokenizers: Developing better ways to tokenize continuous action sequences. Applications:

    • FAST (designed for dexterous, high-frequency tasks).
    • KineVLA [paper]
  • Point-based Actions: Using VLMs to predict keypoints or goal locations rather than full trajectories. Applications:

  • Mid-Level Language Actions: Generating actions as natural language commands to be consumed by a lower-level policy. Applications:

3.3.3 Learning Paradigms

Focuses on how VLA models are trained and adapted.

  • Imitation Learning (IL) / Behavior Cloning (BC): Dominant paradigm, training VLAs to mimic expert demonstrations (often from teleoperation). Heavily reliant on large-scale, diverse, high-quality datasets. Performance is often limited by the quality of the demonstrations. Applications:

  • Reinforcement Learning (RL): Used to fine-tune VLAs or train components, allowing models to learn from interaction and potentially exceed demonstrator performance. Challenges include stability and sample efficiency with large models. Applications:

    • iRe-VLA (iterative RL/SFT), MoRE (RL objective for MoE VLAs handling mixed data), RPD (RL-based policy distillation), ConRFT (RL fine-tuning with consistency policies), SafeVLA (Constrained RL for safety), RIPT-VLA,VLA-RL,SimpleVLA-RL.
    • Robot-R1
    • WoVR (World Models as Reliable Simulators for Post-Training VLA Policies with RL)
  • Pre-training & Fine-tuning: Standard approach, involving pre-training on large datasets (web data for VLM backbones, large robot datasets like OpenX for VLAs) and then fine-tuning on specific tasks or robots.

    • Fine-tuning by RL
      • [2025] ConRFT: A Reinforced Fine-tuning Method for VLA Models via Consistency Policy, RSS 2025 [Paper][Project]
  • Parameter-Efficient Fine-Tuning (PEFT): Techniques like LoRA to efficiently adapt large VLAs without retraining the entire model, crucial for practical deployment and customization. MoRE uses LoRA modules as experts.

  • Distillation: Training smaller, faster models (students) to mimic the behavior of larger, slower models (teachers). Applications:

    • RPD (distilling a VLA to an RL policy), OneDP (distilling a diffusion policy).
  • Curriculum Learning: Structuring the learning process, e.g., by embodiment complexity. Applications:

    • DexVLA uses embodied curriculum.
  • Learning from Mixed-Quality Data: Using techniques (e.g., RL in MoRE) to learn effectively even when demonstration data is suboptimal or contains failures.

  • Reasoning-Augmented Training / Inference: Enhancing policy quality via explicit intermediate reasoning. Applications:

3.3.4 Input Modalities & Grounding

Focuses on input data types beyond standard RGB images and text used by VLAs, and how they ground these inputs.

  • Integrating Speech: Control via spoken commands, potentially capturing nuances missed by text. Requires handling the speech modality directly or via ASR. Applications:

  • Integrating 3D Vision: Using point clouds, voxels, depth maps, or implicit representations (NeRFs, 3DGS) to provide richer spatial understanding. Applications:

  • Integrating Proprioception / State: Incorporating the robot's own state (joint angles, velocities, end-effector pose) as input. Common in many policies, explicitly mentioned in VLAS, PaLM-E, π0 (evaluation requires Simpler fork with proprioception support). OpenVLA initially lacked this, noted as a limitation/future work.

  • Multimodal Prompts: Handling instructions that include images or video in addition to text. Applications:

  • Grounding: The process of linking language descriptions or visual perceptions to specific entities, locations, or actions in the physical world or robot representation. Addressed via various techniques like similarity matching, leveraging common-sense knowledge, multimodal alignment, or interaction. LLM-Grounder focuses on open-vocabulary 3D visual grounding.

3.3.5 Fine-tuning

4. Datasets and Benchmarks

This section lists key resources for training and evaluating VLA models. Large-scale, diverse datasets and standardized benchmarks are crucial for progress in the field.

4.1 Quick Glance at Datasets and Benchmarks

Name Type Focus Area Key Features / Environment Link Key Publication
Open X-Embodiment (OpenX) Dataset General Manipulation
DetailsAggregates 20+ datasets, cross-embodiment/task/environment, >1M trajectories
Project arXiv
DROID Dataset Real-world Manipulation
DetailsLarge-scale human-collected data (500+ tasks, 26k hours)
Project arxiv
CALVIN Dataset / Benchmark Long-Horizon Manipulation
DetailsLong-horizon tasks with language conditioning, Franka arm, PyBullet simulation
Project arxiv
QUARD Dataset Quadruped Robot Tasks
DetailsLarge-scale multi-task dataset (sim + real) for navigation and manipulation
Project ECCV 2024
BEHAVIOR-1K Dataset / Benchmark Household Activities
Details1000 simulated human household activities
Project arxiv
Isaac Sim / Orbit / OmniGibson Simulator High-fidelity Robot Simulation
DetailsNVIDIA Omniverse-based, physically realistic
Isaac-sim, Orbit, OmniGibson -
Habitat Sim Simulator Embodied AI Navigation
DetailsFlexible, high-performance 3D simulator
Project arxiv
MuJoCo Simulator Physics Engine
DetailsPopular physics engine for robotics and RL
Website -
PyBullet Simulator Physics Engine
DetailsOpen-source physics engine, used for CALVIN, etc.
Website -
ManiSkill (1, 2, 3) Benchmark Generalizable Manipulation Skills
DetailsLarge-scale manipulation benchmark based on SAPIEN
Project arxiv
Meta-World Benchmark Multi-task / Meta RL Manipulation
Details50 Sawyer arm manipulation tasks, MuJoCo
Project arxiv
RLBench Benchmark Robot Learning Manipulation
Details100+ manipulation tasks, CoppeliaSim (V-REP)
Project arxiv
VLN-CE / R2R / RxR Benchmark Vision-Language Nav
DetailsStandard VLN benchmarks, often run in Habitat
VLN-CE,R2R-EnvDrop,RxR -

4.2 Robot Learning Datasets

Large-scale datasets of robot interaction trajectories, often with accompanying language instructions and visual observations. Crucial for training general-purpose policies via imitation learning.

  • [2026] HortiMulti: A Multi-Sensor Dataset for Localisation and Mapping in Horticultural Polytunnels [paper]

  • Open X-Embodiment (OpenX) [Project] - Open X-Embodiment Collaboration.

    DetailsA massive, standardized dataset aggregating data from 20+ existing robot datasets, spanning diverse embodiments, tasks, and environments. Used to train major VLAs like RT-X, Octo, OpenVLA, π0. Contains over 1 million trajectories.
  • BridgeData V2 [Project] - Walke, H., et al.

    DetailsLarge dataset collected on a WidowX robot, used for OpenVLA evaluation.
  • DROID [Project] - Manuelli, L., et al.

    DetailsLarge-scale, diverse, human-collected manipulation dataset (500+ tasks, 26k hours). Used to fine-tune/evaluate OpenVLA, π0.
  • RH20T [Project] - Shao, L., et al.

    DetailsComprehensive dataset with 110k robot clips, 110k human demonstrations, and 140+ tasks.
  • CALVIN (Composing Actions from Language and Vision) [Project] - Mees, O., et al.

    DetailsBenchmark and dataset for long-horizon language-conditioned manipulation with a simulated Franka arm in PyBullet.
  • QUARD (QUAdruped Robot Dataset) [Project] - Tang, J., et al.

    Detailsarge-scale multi-task dataset (sim + real) for quadruped navigation and manipulation, released with QUAR-VLA. Contains 348k sim + 3k real clips or 246k sim + 3k real clips.
  • RoboNet [Project] - Dasari, S., et al.

    DetailsEarly large-scale dataset aggregating data from multiple robot platforms.
  • BEHAVIOR-1K [Project] - Srivastava, S., et al.

    DetailsDataset of 1000 simulated human household activities, useful for high-level task understanding.
  • SQA & CSI Datasets [arXiv]- Zhao, W., et al.

    DetailsCurated datasets with speech instructions, released with the VLAS model, for speech-vision-action alignment and fine-tuning.
  • Libero [Project] - Li, Z., et al.

    Details* Benchmark suite for robot lifelong learning with procedurally generated tasks. Used in π0 fine-tuning examples.
  • D4RL (Datasets for Deep Data-Driven Reinforcement Learning) [Code] - Fu, J., et al.

    DetailsStandardized datasets for offline RL research, potentially useful for RL-based VLA methods.

4.3 Simulation Environments

Physics-based simulators used to train agents, generate synthetic data, and evaluate policies in controlled settings before real-world deployment.

  • NVIDIA Isaac Sim / Orbit / OmniGibson [Isaac-sim, Orbit, OmniGibson].

    DetailsHigh-fidelity, physically realistic simulators based on NVIDIA Omniverse. Used for QUAR-VLA, ReKep, ARNOLD, etc.
  • Habitat Sim [Project] - Facebook AI Research (Meta AI).

    DetailsFlexible, high-performance 3D simulator for Embodied AI research, especially navigation.
  • MuJoCo (Multi-Joint dynamics with Contact) [Project].

    DetailsPopular physics engine widely used for robot simulation and RL benchmarks (dm\_control, robosuite, Meta-World, RoboHive).
  • PyBullet [Project.]

    DetailsOpen-source physics engine, used for CALVIN and other benchmarks (panda-gym).
  • SAPIEN [Project.]

    DetailsPhysics simulator focused on articulated objects and interaction. Used for the ManiSkill benchmark.
  • Gazebo [Project.]

    DetailsWidely used open-source robot simulator, especially in the ROS ecosystem.
  • Webots [Project].

    DetailsOpen-source desktop robot simulator.
  • Genesis (GitHub).

    DetailsA newer platform aimed at general robot/Embodied AI simulation.
  • UniSim [arXiv] - Yang, G., et al

    DetailsLearns interactive simulators from real-world videos.

4.4 Evaluation Benchmarks

Standardized suites of environments and tasks used to evaluate and compare the performance of VLA models and other robot learning algorithms.

  • [2026] NavTrust: Benchmarking Trustworthiness for Embodied Navigation [paper] [project] [code]

  • [2026] HUGE-Bench: A Benchmark for High-Level UAV Vision-Language-Action Tasks [paper]

  • [2026] Can LLMs Prove Robotic Path Planning Optimality? A Benchmark for Research-Level Algorithm Verification [paper]

  • CALVIN [Project].

    DetailsBenchmark for long-horizon language-conditioned manipulation.
  • ManiSkill (1, 2, 3) [Project]

    DetailsLarge-scale benchmark for generalizable manipulation skills, based on SAPIEN.
  • Meta-World [Project].

    DetailsMulti-task and meta-RL benchmark with 50 different manipulation tasks using a Sawyer arm in MuJoCo.
  • RLBench [Project].

    DetailsLarge-scale benchmark with 100+ manipulation tasks in CoppeliaSim (V-REP).
  • Franka Kitchen [GitHub].

    Detailsdm\_control-based benchmark involving kitchen tasks with a Franka arm. Used in iRe-VLA.
  • LIBERO [Project].

    DetailsBenchmark for lifelong/continual learning in robot manipulation.
  • VIMA-Bench [Project].

    DetailsMultimodal few-shot prompting benchmark for robot manipulation.
  • BEHAVIOR-1K [Project].

    DetailsBenchmark focused on long-horizon household activities.
  • VLN-CE / R2R / RxR [VLN-CE,R2R-EnvDrop,RxR].

    DetailsStandard benchmarks for Vision-Language Navigation, often run in Habitat. NaVILA is evaluated on these.
  • Safety-CHORES [paper].

    DetailsA new simulated benchmark with safety constraints, proposed for evaluating safe VLA learning.
  • OK-VQA [Project].

    DetailsVisual question answering benchmark requiring external knowledge, used to evaluate the general VLM abilities of [PaLM-E](https://arxiv.org/abs/2303.03378).

5. Challenges and Future Directions

  • Data Efficiency & Scalability: Reducing reliance on massive, expensive, expert-driven datasets. Improving the ability to learn from limited, mixed-quality, or internet-sourced data. Efficiently scaling models and training processes.

    • Future directions: Improved sample efficiency (RL, self-supervision), sim-to-real transfer, automated data generation, efficient architectures (SSMs, MoEs), data filtering/weighting.
  • Inference Speed & Real-Time Control: Current large VLAs may be too slow for the high-frequency control loops needed for dynamic tasks or dexterous manipulation.

    • Future directions: Smaller/compact models (TinyVLA), efficient architectures (RoboMamba), parallel decoding (PD-VLA), action chunking optimization (FAST), model distillation (OneDP, RPD ), hardware acceleration.
  • Robustness & Reliability: Ensuring consistent performance across variations in environment, lighting, object appearance, disturbances, and unexpected events. Current models can be brittle.

    • Future directions: Adversarial training, improved grounding, better 3D understanding, closed-loop feedback, anomaly detection, incorporating physical priors, testing frameworks (VLATest).
  • Generalization: Improving the ability to generalize to new tasks, objects, instructions, environments, and embodiments beyond the training distribution. This is a core promise of VLAs, but remains a challenge.

    • Future directions: Training on more diverse data (OpenX), effective utilization of VLM pre-training knowledge, compositional reasoning, continual/lifelong learning, better action representations.
  • Safety & Alignment: Explicitly incorporating safety constraints to prevent harm to the robot, the environment, or humans. Ensuring alignment with user intent. Crucial for real-world deployment.

    • Future directions: Constrained reinforcement learning (SafeVLA), formal verification, human oversight mechanisms, robust failure detection/recovery, ethical considerations.
  • Dexterity & Contact-Rich Tasks: Improving performance on tasks requiring fine motor skills, precise force control, and handling complex object interactions. Current VLAs often lag behind specialized methods in this area.

    • Future directions: Better action representations (FAST, Diffusion), integration of tactile sensing, improved physical understanding/simulation, hybrid control approaches.
  • Reasoning & Long-Horizon Planning: Enhancing the ability for multi-step reasoning, long-horizon planning, and handling complex instructions.

    • Future directions: Hierarchical architectures, explicit planning modules, chain-of-thought reasoning (visual/textual), memory mechanisms, world models.
  • Multimodality Expansion: Integrating richer sensory inputs beyond vision + language, such as audio/speech, touch, force, 3D.

    • Future directions: Developing architectures and alignment techniques for diverse modalities.

6. Related Awesome Lists

Citation

If you find this repository useful, please consider citing this list:

@misc{liu2025vlaroboticspaperslist,
    title = {Awesome-VLA-Robotics},
    author = {Jiaqi Liu},
    journal = {GitHub repository},
    url = {https://github.com/Jiaaqiliu/Awesome-VLA-Robotics},
    year = {2025},
}

Star History

Star History Chart

About

A comprehensive list of excellent research papers, models, datasets, and other resources on Vision-Language-Action (VLA) models in robotics.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors