Awesome VLA for Robotics

A comprehensive list of excellent research papers, models, datasets, and other resources on Vision-Language-Action (VLA) models in robotics. The relevant survey paper will be released soon. Contributions are welcome!

1. What are VLA Models in Robotics?

Definition: Vision-Language-Action (VLA) models are a class of multimodal AI systems specifically designed for robotics and Embodied AI. They integrate visual perception (from cameras/sensors), natural language understanding (from text or voice commands), and action generation (physical movements or digital tasks) into a unified framework. Unlike traditional robotic systems that often treat perception, planning, and control as separate modules, VLAs aim for end-to-end or tightly integrated processing, similar to how the human brain processes these modalities simultaneously. The term "VLA" gained prominence with the introduction of the RT-2 model. Generally, a VLA is defined as any model capable of processing multimodal inputs (vision, language) to generate robotic actions for completing embodied tasks.
Core Concepts: The basic idea is to leverage the powerful capabilities of large models (LLMs and VLMs) pre-trained on internet-scale data and apply them to robot control. This involves "grounding" language instructions and visual perception information in the physical world to generate appropriate robot actions. The goal is to achieve greater versatility, dexterity, generalization ability, and robustness compared to traditional methods or early reinforcement learning approaches, enabling robots to work effectively in complex, unstructured environments like homes.
Key Components:
- Vision Encoder: Processes raw visual input (images, videos, sometimes 3D data), using architectures like ViT, CLIP encoders, DINOv2, SigLIP to extract meaningful features (object recognition, spatial reasoning).
- Language Understanding: Employs LLM components (such as Llama, PaLM, GPT variants) to process natural language instructions, map commands to context, and perform reasoning.
- Action Decoder/Policy: Generates robot actions based on integrated visual and language understanding (e.g., end-effector pose, joint velocities, gripper commands, base movements). This is a significant differentiator between VLAs and VLMs, involving techniques like action tokenization, diffusion models, or direct regression.
- Alignment Mechanisms: Uses strategies like projection layers and cross-attention to bridge the gap between different modalities, aligning visual, language, and action representations.
Relationship to VLMs and Embodied AI: VLAs are a specialized category within the field of Embodied AI. They extend Vision-Language Models (VLMs) by explicitly incorporating action generation capabilities. VLMs primarily focus on understanding and generating text based on visual input, while VLAs leverage this understanding to interact with the physical world.
Evolution from VLM Adaptation to Integrated Systems: Early VLA research focused mainly on adapting existing VLMs by simply fine-tuning them to output action tokens (e.g., the initial concept of RT-2 ). However, the field is moving towards more integrated architectures where the action generation components are more sophisticated and co-designed (e.g., diffusion policies, specialized action modules, hierarchical systems like Helix or NaVILA ). This evolution indicates that the definition of VLA is shifting from merely fine-tuning VLMs to designing specific VLA architectures that better address the unique requirements of robot action generation, while still leveraging the capabilities of VLMs.

2. Survey papers

[2025] A Survey on Efficient Vision-Language-Action Models [paper]
[2025] Vision-Language-Action Models for Robotics: A Review Towards Real-World Applications [paper]
[2025] Pure Vision Language Action (VLA) Models: A Comprehensive Survey [paper]
[2025] Large VLM-based Vision-Language-Action Models for Robotic Manipulation: A Survey [paper] [project]
[2025] A Survey on Vision-Language-Action Models: An Action Tokenization Perspective[paper]
[2025] Foundation Model Driven Robotics: A Comprehensive Review [paper]
[2025] A Survey on Vision-Language-Action Models for Autonomous Driving [paper] [project]
[2025] Parallels Between VLA Model Post-Training and Human Motor Learning: Progress, Challenges, and Trends [paper] [project]
[2025] A Survey on Vision-Language-Action Models for Embodied AI. [paper]
[2025] Foundation Models in Robotics: Applications, Challenges, and the Future [paper] [project]
[2025] Vision Language Action Models in Robotic Manipulation: A Systematic Review [paper]
[2025] Vision-Language-Action Models: Concepts, Progress, Applications and Challenges [paper]
[2025] OpenHelix: A Short Survey, Empirical Analysis, and Open-Source Dual-System VLA Model for Robotic Manipulation [paper][project]
[2025] Exploring Embodied Multimodal Large Models: Development, Datasets, and Future Directions [paper]
[2025] Multimodal Fusion and Vision-Language Models: A Survey for Robot Vision [paper] [project]
[2025] Generative Artificial Intelligence in Robotic Manipulation: A Survey [paper] [project]
[2025] Neural Brain: A Neuroscience-inspired Framework for Embodied Agents [paper] [project]
[2024] Aligning Cyber Space with Physical World: A Comprehensive Survey on Embodied AI. [paper]
[2024] A Survey on Robotics with Foundation Models: toward Embodied AI. [paper]
[2024] What Foundation Models can Bring for Robot Learning in Manipulation: A Survey. [paper]
[2024] Towards Generalist Robot Learning from Internet Video: A Survey. [paper]
[2024] Large Multimodal Agents: A Survey. [paper]
[2024] A Survey on Integration of Large Language Models with Intelligent Robots. [paper]
[2024] Vision-Language Models for Vision Tasks: A Survey. [paper]
[2024] A Survey of Embodied Learning for Object-Centric Robotic Manipulation [paper]
[2024] Vision-language navigation: a survey and taxonomy [paper]
[2023] Toward general-purpose robots via foundation models: A survey and meta-analysis. [paper]
[2023] Robot learning in the era of foundation models: A survey. [paper]

3. Key VLA Models and Research Papers

This section is the heart of the resource, listing specific VLA models and influential research papers. Papers are first categorized by major application area, then by key technical contributions. A paper/model may appear in multiple subsections if it is relevant to several categories.

3.1 Quick Glance at Key VLA Models

Model Name	Key Contribution / Features	Base VLM / Architecture	Action Generation Method	Paper/ Project / Code
RT-1	Details First large-scale Transformer robot model; Demonstrates scalability on multi-task real-world data; Action discretization	Transformer (EfficientNet-B3 vision)	Action binning + Token output	arxiv / Project / Code
RT-2	Details Transfers web knowledge of VLMs to robot control; Joint fine-tuning of VLM to output action tokens; Shows emergent generalization	PaLI-X / PaLM-E (Transformer)	Action binning + Token output	arxiv / Project
PaLM-E	Details Embodied multimodal language model; Injects continuous sensor data (image, state) into pre-trained LLM; Usable for sequential manipulation planning, VQA, etc.	PaLM (Transformer)	Outputs subgoals or action descriptions	ICML / Project
OpenVLA	Details Open-source 7B parameter VLA; Based on Llama 2; Trained on OpenX dataset; Outperforms RT-2-X; Shows good generalization and PEFT ability	Llama 2 (DINOv2 + SigLIP vision)	Action binning + Token output	arxiv / Project / Code / HF
Helix	Details General-purpose VLA for humanoid robots; Hierarchical architecture (System 1/2); Full-body control; Multi-robot collaboration; Onboard deployment	Custom VLM (System 2) + Visuomotor Policy (System 1)	Continuous action output (System 1)	Paper / Project
π0 (Pi-Zero)	Details General-purpose VLA; Uses Flow Matching to generate continuous action trajectories (50Hz); Cross-platform training (7 platforms, 68 tasks)	PaliGemma (Transformer) + Action Expert	Flow Matching	arXiv / Project / Code / HF
Octo	Details General-purpose robot model; Trained on OpenX dataset; Flexible input/output conditioning; Often used as a baseline	Transformer (ViT)	Action binning + Token output / Diffusion Head	arXiv/ Project / Code
SayCan	Details Grounds LLM planning in robot affordances; Uses LLM to score skill relevance + value function to score executability	PaLM (Transformer) + Value Function	Selects pre-defined skills (high-level planner)	arXiv / Project / Code
NaVILA	Details Two-stage framework for legged robot VLN; High-level VLA outputs mid-level language actions, low-level vision-motor policy executes	InternVL-Chat-V1.5 (VLM) + Locomotion Policy (RL)	Mid-level language action output (VLA)	arXiv / Project
VLAS	Details First end-to-end VLA with direct integration of speech commands; Based on LLaVA; Three-stage fine-tuning for voice commands; Supports personalized tasks (Voice RAG)	LLaVA (Transformer) + Speech Encoder	Action binning + Token output	arXiv
CoT-VLA	Details Incorporates explicit Visual Chain-of-Thought (Visual CoT); Predicts future goal images before generating actions; Hybrid attention mechanism	Llama 2 (ViT vision)	Action binning + Token output (after predicting visual goals)	arXiv / Project
TinyVLA	Details Compact, fast, and data-efficient VLA; Requires no pre-training; Uses small VLM + diffusion policy decoder	MobileVLM V2 / Moondream2 + Diffusion Policy Decoder	Diffusion Policy	arXiv / Project
CogACT	Details Componentized VLA architecture; Specialized action module (Diffusion Action Transformer) conditioned on VLM output; Significantly outperforms OpenVLA / RT-2-X	InternVL-Chat-V1.5 (VLM) + Diffusion Action Transformer	Diffusion Policy	arXiv / Project
TLA	Details Tactile-Language-Action (TLA) model; sequential tactile feedback via cross-modal language grounding to enable robust policy generation in contact-intensive scenarios.	Qwen2 7B + LoRA + Qwen2-VL	Qwen2	arXiv / Project
OpenVLA-OFT	Details Optimized Fine-Tuning (OFT)	Llama 2 (DINOv2 + SigLIP vision)	L1 regression	arXiv
RDT	Details Robotics Diffusion	InternVL-Chat-V1.5 (VLM) + Diffusion Action Transformer	Diffusion Policy	arXiv
π0.5	Details Open-world generalist VLA; Combines high-level VLM reasoning with π0 low-level dexterity; Web-scale + robot data co-training; Excels at long-horizon tasks in unseen environments	PaliGemma-based VLM + π0 Action Expert	Flow Matching (Hierarchical)	arXiv / Project
GR00T N1	Details NVIDIA's open foundation model for humanoid robots; Dual-system (System 1 fast / System 2 slow); Sim-to-real with Isaac; Full-body whole-body control	Custom VLM + Diffusion Policy	Diffusion Policy	arXiv / Code
Gemini Robotics	Details Google DeepMind's multimodal robot foundation model; Built on Gemini 2.0; Safety-aware; Multi-embodiment generalization; World understanding + dexterous control	Gemini 2.0 (Multimodal Transformer)	Continuous action output	Report
DexVLA	Details Scaling VLA for dexterous manipulation across embodiments; Embodiment curriculum learning; Diffusion action module; Strong on bimanual and multi-finger tasks	VLM + Diffusion Action Module	Diffusion Policy	arXiv / Project
ABot-M0	Details A novel VLA foundation model based on action manifold learning	Qwen3-VL (VLM) + Action Manifold Learning DiT	Action Manifold Learning + Flow Matching	arXiv / Project

3.2 By Application Area

3.2.1 Manipulation

Focuses on tasks involving interaction with objects, ranging from simple pick-and-place to complex, dexterous, long-horizon activities. This is a major application area for VLA research.

2026

[2026] Observing and Controlling Features in Vision-Language-Action Models [paper] (Stanford, Marco Pavone)
[2026] Pretrained Vision-Language-Action Models are Surprisingly Resistant to Forgetting in Continual Learning [paper] (UT Austin, Yuke Zhu)
[2026] EgoScale: Scaling Dexterous Manipulation with Diverse Egocentric Human Data [paper] (UC Berkeley, UT Austin — Trevor Darrell, Yuke Zhu, Linxi Fan)
[2026] FAVLA: A Force-Adaptive Fast-Slow VLA model for Contact-Rich Robotic Manipulation [paper]
[2026] Self-Correcting VLA: Online Action Refinement via Sparse World Imagination [paper]
[2026] LiLo-VLA: Compositional Long-Horizon Manipulation via Linked Object-Centric Policies [paper]
[2026] DySL-VLA: Efficient Vision-Language-Action Model Inference via Dynamic-Static Layer-Skipping for Robot Manipulation [paper]

2025

[2025] GR-RL: Going Dexterous and Precise for Long-Horizon Robotic Manipulation [paper] [project]
[2025] EvoVLA: Self-Evolving Vision-Language-Action Model [paper] [project] [code]
[2025] OmniVLA: Physically-Grounded Multimodal VLA with Unified Multi-Sensor Perception for Robotic Manipulation [paper]
[2025] π0.6: a VLA that Learns from Experience [paper]
[2025] HiMoE-VLA: Hierarchical Mixture-of-Experts for Generalist Vision-Language-Action Policies [paper] [code]
[2025] Mixture of Horizons in Action Chunking [paper] [project] [code]
[2025] VLA-0: Building State-of-the-Art VLAs with Zero Modification [paper] [project] [code]
[2025] Wall-OSS: Igniting VLMs toward the Embodied Space [project] [paper] [code]
[2025] Genie Envisioner: A Unified World Foundation Platform for Robotic Manipulation [paper] [project] [code]
[2025] GR-3 Technical Report [paper] [project]
[2025] Discrete Diffusion VLA: Bringing Discrete Diffusion to Action Decoding in Vision-Language-Action Policies [paper]
[2025] UniVLA: Unified Vision-Language-Action Model [paper] [code]
[2025] Gemini Robotics On-Device brings AI to local robotic devices [report]
[2025] EgoVLA: Learning Vision-Language-Action Models from Egocentric Human Videos [paper][Project]
[2025]GeoVLA: Empowering 3D Representations in Vision-Language-Action Models[Project]
[2025] DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge [paper][Code][Project]
[2025] TriVLA: A Triple-System-Based Unified Vision-Language-Action Model for General Robot Control [paper][Project]
[2025] VOTE: Vision-Language-Action Optimization with Trajectory Ensemble Voting [paper]
[2025] V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning [paper] [project] [code]
[2025] VLA-RL: Towards Masterful and General Robotic Manipulation with Scalable Reinforcement Learning [paper][Code]
[2025] CEED-VLA: Consistency Vision-Language-Action Model with Early-Exit Decoding [paper][Project]
[2025] Knowledge Insulating Vision-Language-Action Models: Train Fast, Run Fast, Generalize Better [paper] [project]
[2025] SwitchVLA: Execution-Aware Task Switching for Vision-Language-Action Models [paper][Project]
[2025] Helix: A Vision-Language-Action Model for Generalist Humanoid Control [project]
[2025] CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models [paper] [project]
[2025] Hi Robot: Open-Ended Instruction Following with Hierarchical Vision-Language-Action Models [paper] [project]
[2025] Interactive Post-Training for Vision-Language-Action Models (RIPT-VLA) [paper] [project]
[2025] UP-VLA: A Unified Understanding and Prediction Model for Embodied Agent [paper]
[2025] NORA: A Small Open-Sourced Generalist Vision Language Action Model for Embodied Tasks [paper] [project]
[2025] DexVLA: Scaling Vision-Language-Action Models for Dexterous Manipulation Across Embodiments [paper] [project]
[2025] Shake-VLA: Shake, Stir, and Pour with a Dual-Arm Robot: A Vision-Language-Action Model for Automated Cocktail Making [paper]
[2025] VLA Model-Expert Collaboration: Enhancing Vision-Language-Action Models with Human Corrections via Shared Autonomy [paper] [project]
[2025] FAST: Efficient Action Tokenization for Vision-Language-Action Models [paper] [project]
[2025] HybridVLA: Integrating Diffusion and Autoregressive Action Prediction for Generalist Robot Control [paper] [project] [code]
[2025] Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success [paper] [project(OpenVLA-OFT)]
[2025] RoboBrain: A Unified Brain Model for Robotic Manipulation from Abstract to Concrete [paper] [project] [code]
[2025] GRAPE: Generalizing Robot Policy via Preference Alignment [paper] [Project][Code]
[2025] OTTER: A Vision-Language-Action Model with Text-Aware Visual Feature Extraction [paper] [project]
[2025] PointVLA: Injecting the 3D World into Vision-Language-Action Models [paper] [project]
[2025] AgiBot World Colosseo: Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems [paper] [project]
[2025] DexGraspVLA: A Vision-Language-Action Framework Towards General Dexterous Grasping [paper] [project]
[2025] Interleave-VLA: Enhancing Robot Manipulation with Interleaved Image-Text Instructions [paper]
[2025] EnerVerse: Envisioning Embodied Future Space for Robotics Manipulation [paper]
[2025] VLA-Cache: Towards Efficient Vision-Language-Action Model via Adaptive Token Caching in Robotic Manipulation[paper]
[2025] SafeVLA: Towards Safety Alignment of Vision-Language-Action Model via Safe Reinforcement Learning [paper] [project]
[2025] Accelerating Vision-Language-Action Model Integrated with Action Chunking via Parallel Decoding [paper(PD-VLA)]
[2025] Refined Policy Distillation: From VLA Generalists to RL Experts [paper(RPD)]
[2025] MoLe-VLA: Dynamic Layer-skipping Vision Language Action Model via Mixture-of-Layers for Efficient Robot Manipulation [paper][project] [Code]
[2025] MoManipVLA: Transferring Vision-language-action Models for General Mobile Manipulation [paper] [project]
[2025] Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy [paper] [project]
[2025] Gemini Robotics: Bringing AI into the Physical World [report]
[2025] ConRFT: A Reinforced Fine-tuning Method for VLA Models via Consistency Policy, RSS 2025 [Paper][Project]
[2025] RoboGround: Robotic Manipulation with Grounded Vision-Language Priors [paper] [project]
[2025] ViSA-Flow: Accelerating Robot Skill Learning via Large-Scale Video Semantic Action Flow [paper] [project]
[2025] Interleave-VLA: Enhancing Robot Manipulation with Interleaved Image-Text Instructions [paper] [project]
[2025] OpenHelix: A Short Survey, Empirical Analysis, and Open-Source Dual-System VLA Model for Robotic Manipulation [paper] [project]
[2025] ReinboT: Amplifying Robot Visual-Language Manipulation with Reinforcement Learning
[2025] CrayonRobo: Object-Centric Prompt-Driven Vision-Language-Action Model for Robotic Manipulation [paper]
[2025] Learning to Act Anywhere with Task-centric Latent Actions [paper(UniVLA)] [project]
[2025] Pixel Motion as Universal Representation for Robot Control [paper] [project]

2024

[2024] OpenVLA: An Open-Source Vision-Language-Action Model [paper] [code]
[2024] π₀ (Pi-Zero): Our First Generalist Policy [project] [code]
[2024] Octo: An Open-Source Generalist Robot Policy [paper] [project] [Code]
[2024] RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation [paper]
[2024] ReKep: Spatio-Temporal Reasoning of Relational Keypoint Constraints for Robotic Manipulation [paper] [project] [code]
[2024] OK-Robot: What Really Matters in Integrating Open-Knowledge Models for Robotics [paper] [project] [code]
[2024] 3D-VLA: A 3D Vision-Language-Action Generative World Model [paper] [code]
[2024] TinyVLA: Towards Fast, Data-Efficient Vision-Language-Action Models for Robotic Manipulation [paper] [project]
[2024] CogACT: Componentized Vision-Language-Action Models for Robotic Control [paper] [project]
[2024] RoboMM: All-in-One Multimodal Large Model for Robotic Manipulation [paper][Code]
[2024] Diffusion-VLA: Scaling Robot Foundation Models via Unified Diffusion and Autoregression [paper] [project]
[2024] HiRT: Enhancing Robotic Control with Hierarchical Robot Transformers [paper]
[2024] GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation [paper]
[2024] DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution [paper] [code]
[2024] RT-Affordance: Affordances are Versatile Intermediate Representations for Robot Manipulation [paper]
[2024] Moto: Latent Motion Token as the Bridging Language for Robot Manipulation [paper] [project]
[2024] Bi-VLA: Vision-Language-Action Model-Based System for Bimanual Robotic Dexterous Manipulations [paper]
[2024] An Embodied Generalist Agent in 3D World [paper]
[2024] Towards Synergistic, Generalized, and Efficient Dual-System for Robotic Manipulation [paper][project] [code]

2023

[2023] RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control [paper] [project]
[2023] PaLM-E: An Embodied Multimodal Language Model [paper] [project]
[2023] VIMA: General Robot Manipulation with Multimodal Prompts [paper] [project]
[2023] VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models [paper] [project] [code]

2022

[2022] RT-1: Robotics Transformer for Real-World Control at Scale [paper] [code]
[2022] Do As I Can, Not As I Say: Grounding Language in Robotic Affordances (SayCan) [paper] [project] [code]
[2022] Perceiver-Actor: A Multi-Task Transformer for Robotic Manipulation [paper] [project] [code]

3.2.2 Navigation and Mobile Manipulation

Focuses on tasks where a robot moves through an environment based on visual input and language instructions. Includes Vision-Language Navigation (VLN) and applications for legged robots.

[2026] OpenFrontier: General Navigation with Visual-Language Grounded Frontiers [paper]
[2026] PROSPECT: Unified Streaming Vision-Language Navigation via Semantic--Spatial Fusion and Latent Predictive Representation [paper]
[2026] History-Conditioned Spatio-Temporal Visual Token Pruning for Efficient Vision-Language Navigation [paper]
[2026] AnyCamVLA: Zero-Shot Camera Adaptation for Viewpoint Robust Vision-Language-Action Models [paper] [project]
[2025] OctoNav: Towards Generalist Embodied Navigation [paper] [project]
[2025] Do Visual Imaginations Improve Vision-and-Language Navigation Agents? [paper] [project]
[2025] SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Models [paper] [project]
[2025] FlexVLN: Flexible Adaptation for Diverse Vision-and-Language Navigation Tasks [paper]
[2025] MoManipVLA: Transferring Vision-language-action Models for General Mobile Manipulation [paper] [project]
[2024] NaVILA: Legged Robot Vision-Language-Action Model for Navigation [paper] [project]
[2024] QUAR-VLA: Vision-Language-Action Model for Quadruped Robots [paper] [project]
[2024] NaviLLM: Towards Learning a Generalist Model for Embodied Navigation [paper] [code]
[2024] NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation [paper] [project]
[2023] VLN-SIG: Improving Vision-and-Language Navigation by Generating Future-View Image Semantics [paper] [project]
[2023] PanoGen: Text-Conditioned Panoramic Environment Generation for Vision-and-Language Navigation [paper] [project]

3.2.3 Human-Robot Interaction (HRI)

Focuses on enabling more natural and effective interactions between humans and robots, often using language (text or speech) as the primary interface.

[2026] The Robot's Inner Critic: Self-Refinement of Social Behaviors through VLM-based Replanning [paper] [project]
[2026] Morphology-Consistent Humanoid Interaction through Robot-Centric Video Synthesis [paper]
[2026] Not an Obstacle for Dog, but a Hazard for Human: A Co-Ego Navigation System for Guide Dog Robots [paper]
[2025] Unveiling the Potential of Vision-Language-Action Models with Open-Ended Multimodal Instructions [paper] (OE-VLA)
[2025] VLAS: Vision-Language-Action Model With Speech Instructions For Customized Robot Manipulation [paper]
[2025] Shake-VLA: Vision-Language-Action Model-Based System for Bimanual Robotic Manipulations and Liquid Mixing [paper]
[2025] VLA Model-Expert Collaboration for Bi-directional Manipulation Learning [paper]
[2025] Helix: A Vision-Language-Action Model for Generalist Humanoid Control [project]
[2025] CognitiveDrone: A VLA Model and Evaluation Benchmark for Real-Time Cognitive Task Solving and Reasoning in UAVs [paper][project]
[2024] TalkWithMachines: Enhancing Human-Robot Interaction for Interpretable Industrial Robotics Through Large/Vision Language Models [paper][project]

3.2.4 Task Planning / Reasoning

Focuses on using VLA/LLM components for high-level task decomposition, planning, and reasoning, often bridging the gap between complex instructions and low-level actions.

[2026] HALO: A Unified Vision-Language-Action Model for Embodied Multimodal Chain-of-Thought Reasoning [paper]
[2026] Self-Correcting VLA: Online Action Refinement via Sparse World Imagination [paper]
[2026] LiLo-VLA: Compositional Long-Horizon Manipulation via Linked Object-Centric Policies [paper]
[2025]MemER: Scaling Up Memory for Robot Control via Experience Retrieval [paper] [project]
[2025] Long-VLA: Unleashing Long-Horizon Capability of Vision Language Action Model for Robot Manipulation [paper] [project]
[2025] ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning [paper] [project]
[2025] Agentic Robot: A Brain-Inspired Framework for Vision-Language-Action Models in Embodied Agents [paper] [project]
[2025] Training Strategies for Efficient Embodied Reasoning (ECoT-Lite) [paper]
[2025] OneTwoVLA: A Unified Vision-Language-Action Model with Adaptive Reasoning [paper] [project]
[2025] Vision-Language-Action Model with Open-World Embodied Reasoning from Pretrained Knowledge [paper] [project]
[2025] Hume: Introducing System-2 Thinking in Visual-Language-Action Model [paper] [project]
[2025] Robotic Control via Embodied Chain-of-Thought Reasoning [paper] [project][code]
[2025] GR00T N1: An Open Foundation Model for Generalist Humanoid Robots [paper] [Code]
[2025] Gemini Robotics: Bringing AI into the Physical World [report]
[2025] GRAPE: Generalizing Robot Policy via Preference Alignment [paper]
[2025] HAMSTER: Hierarchical Action Models For Open-World Robot Manipulation [paper]
[2025] π0.5: A Vision-Language-Action Model with Open-World Generalization[paper] [project]
[2025] Hi Robot: Open-Ended Instruction Following with Hierarchical Vision-Language-Action Models [paper] [project]
[2025] CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models [paper] [project]
[2025] Embodied-R: Collaborative Framework for Activating Embodied Spatial Reasoning in Foundation Models via Reinforcement Learning [paper] [project]
[2025] RoboBrain: A Unified Brain Model for Robotic Manipulation from Abstract to Concrete [paper] [project] [code]
[2025] Probing a Vision-Language-Action Model for Symbolic States and Integration into a Cognitive Architecture [paper]
[2024] RoboMamba: Efficient Vision-Language-Action Model for Robotic Reasoning and Manipulation [paper] [project]
[2024] Improving Vision-Language-Action Models via Chain-of-Affordance [paper] [project]
[2023] PaLM-E: An Embodied Multimodal Language Model [paper] [project]
[2023] EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought [paper] [code]
[2022] LLM-Planner: Few-Shot Grounded Planning with Large Language Models [paper] [project]
[2022] Code as Policies: Language Model Programs for Embodied Control [paper] [project]
[2022] Inner Monologue: Embodied Reasoning through Planning with Language Models [paper] [project]
[2022] Do As I Can, Not As I Say: Grounding Language in Robotic Affordances (SayCan) [paper] [project] [code]

3.2.5 Humanoid

[2026] PhysiFlow: Physics-Aware Humanoid Whole-Body VLA via Multi-Brain Latent Flow Matching and Robust Tracking [paper]
[2026] Cognition to Control - Multi-Agent Learning for Human-Humanoid Collaborative Transport [paper]
[2025] GR00T N1: An Open Foundation Model for Generalist Humanoid Robots [paper] [Code]
[2025] Helix: A Vision-Language-Action Model for Generalist Humanoid Control [project]
[2025] Humanoid-VLA: Towards Universal Humanoid Control with Visual Integration [paper]

3.2.6 Other

[2025] Adversarial Attacks on Robotic Vision Language Action Models [paper][Code]
[2025] Vision-Language-Action Model with Open-World Embodied Reasoning from Pretrained Knowledge [paper][Project](ChatVLA-2)
[2025] ChatVLA: Unified Multimodal Understanding and Robot Control with Vision-Language-Action Model [paper][Project]
[2025] OpenDriveVLA: Towards End-to-end Autonomous Driving with Large Vision Language Action Model [paper][Project][Code]
[2024] OccLLaMA: An Occupancy-Language-Action Generative World Model for Autonomous Driving [paper]
[2024] DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models [paper][Project]
[2024] EMMA: End-to-End Multimodal Model for Autonomous Driving [paper][Code]

3.3 By Technical Approach

3.3.1 Model Architectures

Focuses on the core neural network architectures used in VLA models.

Transformer-based: The dominant architecture, leveraging self-attention mechanisms to integrate vision, language, and action sequences. Applications:
- RT-1, RT-2, Octo, OpenVLA, VIMA, QUART.
Diffusion-based: Primarily for the action generation component, utilizing the ability of diffusion models to model complex distributions. Often combined with a Transformer backbone. Applications:
- Diffusion Policy, Octo (can use diffusion head), 3D Diffuser Actor, SUDD, MDT, RDT-1B, DexVLA, DiVLA, TinyVLA, Hybrid VLA+Diffusion.
Hierarchical / Decoupled: Architectures that separate high-level reasoning/planning (often VLM/LLM-based) from low-level control/execution (which may be a separate policy). Applications:
- Helix (System 1/2), NaVILA (VLA + Locomotion Policy), Hi Robot (VLM + π0), SayCan (LLM + Value Function).
- TriVLA
- HALO (Unified VLA with Embodied Multimodal Chain-of-Thought)
State-Space Models (SSM): Emerging architectures like Mamba are being explored for their efficiency. Applications:
- RoboMamba
Mixture-of-Experts (MoE / MoLE): Using sparsely activated expert modules for task adaptation or efficiency. Applications:
- MoRE (Mixture-of-Robotic-Experts using LoRA), CogACT, π0 (uses an MoE-like structure).
- MolE-VLA, ChatVLA
- HiMoE-VLA: Hierarchical Mixture-of-Experts for Generalist Vision-Language-Action Policies
- [2026] MoE-ACT [paper] [project]
Recent Architecture Updates (2026.03):
- AutoMoT [paper] [project]
- ForceVLA2 [paper] [project]

3.3.2 Action Representation & Generation

Focuses on how robot actions are represented (e.g., discrete tokens vs. continuous vectors) and how models generate them. This is a key area differentiating VLAs from VLMs.

Action Tokenization / Discretization: Representing continuous actions (e.g., joint angles, end-effector pose) as discrete tokens, often via binning. Used in early/many Transformer-based VLAs like RT-1, RT-2 to fit the language modeling paradigm. May have limitations in precision and high-frequency control.
Continuous Action Regression: Directly predicting continuous action vectors. Sometimes used in conjunction with other methods or implemented via specific heads. L1 regression is used in OpenVLA-OFT.
Diffusion Policies for Actions: Modeling action generation as a denoising diffusion process. Good at capturing multi-modality and continuous spaces. Applications:
- Diffusion Policy, Octo (diffusion head), SUDD, MDT, RDT-1B, DexVLA, DiVLA, TinyVLA. Can be slow due to iterative sampling.
- Discrete Diffusion VLA
Flow Matching: An alternative generative method for continuous actions, used in π0 for efficient, high-frequency (50Hz) trajectory generation.
- FASTER [paper] [project]
- GeCO [paper] [project]
- ProbeFlow [paper]
Action Chunking: Predicting multiple future actions in a single step, for efficiency and temporal consistency. Increases action dimensionality and inference time when using AR decoding. Applications:
- CogACT, RoboAgent, π0, PD-VLA.
- Mixture of Horizons in Action Chunking
Better Decoding Strategy: Techniques to speed up autoregressive decoding of action chunks.
- Parallel Decoding: PD-VLA.
- Early-Exit Decoding: CEED-VLA
Specialized Tokenizers: Developing better ways to tokenize continuous action sequences. Applications:
- FAST (designed for dexterous, high-frequency tasks).
- KineVLA [paper]
Point-based Actions: Using VLMs to predict keypoints or goal locations rather than full trajectories. Applications:
- PIVOT, RoboPoint, ReKep.
Mid-Level Language Actions: Generating actions as natural language commands to be consumed by a lower-level policy. Applications:
- NaVILA.

3.3.3 Learning Paradigms

Focuses on how VLA models are trained and adapted.

Imitation Learning (IL) / Behavior Cloning (BC): Dominant paradigm, training VLAs to mimic expert demonstrations (often from teleoperation). Heavily reliant on large-scale, diverse, high-quality datasets. Performance is often limited by the quality of the demonstrations. Applications:
- RT-1, RT-2, OpenVLA （pre-training part）, Octo, Diffusion Policy, etc.
Reinforcement Learning (RL): Used to fine-tune VLAs or train components, allowing models to learn from interaction and potentially exceed demonstrator performance. Challenges include stability and sample efficiency with large models. Applications:
- iRe-VLA (iterative RL/SFT), MoRE (RL objective for MoE VLAs handling mixed data), RPD (RL-based policy distillation), ConRFT (RL fine-tuning with consistency policies), SafeVLA (Constrained RL for safety), RIPT-VLA,VLA-RL,SimpleVLA-RL.
- Robot-R1
- WoVR (World Models as Reliable Simulators for Post-Training VLA Policies with RL)
Pre-training & Fine-tuning: Standard approach, involving pre-training on large datasets (web data for VLM backbones, large robot datasets like OpenX for VLAs) and then fine-tuning on specific tasks or robots.
- Fine-tuning by RL
  - [2025] ConRFT: A Reinforced Fine-tuning Method for VLA Models via Consistency Policy, RSS 2025 [Paper][Project]
Parameter-Efficient Fine-Tuning (PEFT): Techniques like LoRA to efficiently adapt large VLAs without retraining the entire model, crucial for practical deployment and customization. MoRE uses LoRA modules as experts.
Distillation: Training smaller, faster models (students) to mimic the behavior of larger, slower models (teachers). Applications:
- RPD (distilling a VLA to an RL policy), OneDP (distilling a diffusion policy).
Curriculum Learning: Structuring the learning process, e.g., by embodiment complexity. Applications:
- DexVLA uses embodied curriculum.
Learning from Mixed-Quality Data: Using techniques (e.g., RL in MoRE) to learn effectively even when demonstration data is suboptimal or contains failures.
Reasoning-Augmented Training / Inference: Enhancing policy quality via explicit intermediate reasoning. Applications:
- VLA-Thinker [paper] [project]

3.3.4 Input Modalities & Grounding

Focuses on input data types beyond standard RGB images and text used by VLAs, and how they ground these inputs.

Integrating Speech: Control via spoken commands, potentially capturing nuances missed by text. Requires handling the speech modality directly or via ASR. Applications:
- VLAS (direct integration), Shake-VLA (uses external STT/TTS).
Integrating 3D Vision: Using point clouds, voxels, depth maps, or implicit representations (NeRFs, 3DGS) to provide richer spatial understanding. Applications:
- GeoVLA, 3D-VLA, PerAct, Act3D, RVT, RVT-2, RoboUniView, DP3, 3D Diffuser Actor, LEO, 3D-LLM, LLM-Grounder, SpatialVLA.
- Bridge VLA
Integrating Proprioception / State: Incorporating the robot's own state (joint angles, velocities, end-effector pose) as input. Common in many policies, explicitly mentioned in VLAS, PaLM-E, π0 (evaluation requires Simpler fork with proprioception support). OpenVLA initially lacked this, noted as a limitation/future work.
Multimodal Prompts: Handling instructions that include images or video in addition to text. Applications:
- VIMA.
Grounding: The process of linking language descriptions or visual perceptions to specific entities, locations, or actions in the physical world or robot representation. Addressed via various techniques like similarity matching, leveraging common-sense knowledge, multimodal alignment, or interaction. LLM-Grounder focuses on open-vocabulary 3D visual grounding.

3.3.5 Fine-tuning

FT by RL
- ConRFT: A Reinforced Fine-tuning Method for VLA Models via Consistency Policy, RSS 2025

4. Datasets and Benchmarks

This section lists key resources for training and evaluating VLA models. Large-scale, diverse datasets and standardized benchmarks are crucial for progress in the field.

4.1 Quick Glance at Datasets and Benchmarks

Name	Type	Focus Area	Key Features / Environment	Link	Key Publication
Open X-Embodiment (OpenX)	Dataset	General Manipulation	Details Aggregates 20+ datasets, cross-embodiment/task/environment, >1M trajectories	Project	arXiv
DROID	Dataset	Real-world Manipulation	Details Large-scale human-collected data (500+ tasks, 26k hours)	Project	arxiv
CALVIN	Dataset / Benchmark	Long-Horizon Manipulation	Details Long-horizon tasks with language conditioning, Franka arm, PyBullet simulation	Project	arxiv
QUARD	Dataset	Quadruped Robot Tasks	Details Large-scale multi-task dataset (sim + real) for navigation and manipulation	Project	ECCV 2024
BEHAVIOR-1K	Dataset / Benchmark	Household Activities	Details 1000 simulated human household activities	Project	arxiv
Isaac Sim / Orbit / OmniGibson	Simulator	High-fidelity Robot Simulation	Details NVIDIA Omniverse-based, physically realistic	Isaac-sim, Orbit, OmniGibson	-
Habitat Sim	Simulator	Embodied AI Navigation	Details Flexible, high-performance 3D simulator	Project	arxiv
MuJoCo	Simulator	Physics Engine	Details Popular physics engine for robotics and RL	Website	-
PyBullet	Simulator	Physics Engine	Details Open-source physics engine, used for CALVIN, etc.	Website	-
ManiSkill (1, 2, 3)	Benchmark	Generalizable Manipulation Skills	Details Large-scale manipulation benchmark based on SAPIEN	Project	arxiv
Meta-World	Benchmark	Multi-task / Meta RL Manipulation	Details 50 Sawyer arm manipulation tasks, MuJoCo	Project	arxiv
RLBench	Benchmark	Robot Learning Manipulation	Details 100+ manipulation tasks, CoppeliaSim (V-REP)	Project	arxiv
VLN-CE / R2R / RxR	Benchmark	Vision-Language Nav	Details Standard VLN benchmarks, often run in Habitat	VLN-CE,R2R-EnvDrop,RxR	-

4.2 Robot Learning Datasets

Large-scale datasets of robot interaction trajectories, often with accompanying language instructions and visual observations. Crucial for training general-purpose policies via imitation learning.

[2026] HortiMulti: A Multi-Sensor Dataset for Localisation and Mapping in Horticultural Polytunnels [paper]
Open X-Embodiment (OpenX) [Project] - Open X-Embodiment Collaboration.

Details
A massive, standardized dataset aggregating data from 20+ existing robot datasets, spanning diverse embodiments, tasks, and environments. Used to train major VLAs like RT-X, Octo, OpenVLA, π0. Contains over 1 million trajectories.
BridgeData V2 [Project] - Walke, H., et al.

Details
Large dataset collected on a WidowX robot, used for OpenVLA evaluation.
DROID [Project] - Manuelli, L., et al.

Details
Large-scale, diverse, human-collected manipulation dataset (500+ tasks, 26k hours). Used to fine-tune/evaluate OpenVLA, π0.
RH20T [Project] - Shao, L., et al.

Details
Comprehensive dataset with 110k robot clips, 110k human demonstrations, and 140+ tasks.
CALVIN (Composing Actions from Language and Vision) [Project] - Mees, O., et al.

Details
Benchmark and dataset for long-horizon language-conditioned manipulation with a simulated Franka arm in PyBullet.
QUARD (QUAdruped Robot Dataset) [Project] - Tang, J., et al.

Details
arge-scale multi-task dataset (sim + real) for quadruped navigation and manipulation, released with QUAR-VLA. Contains 348k sim + 3k real clips or 246k sim + 3k real clips.
RoboNet [Project] - Dasari, S., et al.

Details
Early large-scale dataset aggregating data from multiple robot platforms.
BEHAVIOR-1K [Project] - Srivastava, S., et al.

Details
Dataset of 1000 simulated human household activities, useful for high-level task understanding.
SQA & CSI Datasets [arXiv]- Zhao, W., et al.

Details
Curated datasets with speech instructions, released with the VLAS model, for speech-vision-action alignment and fine-tuning.
Libero [Project] - Li, Z., et al.

Details
* Benchmark suite for robot lifelong learning with procedurally generated tasks. Used in π0 fine-tuning examples.
D4RL (Datasets for Deep Data-Driven Reinforcement Learning) [Code] - Fu, J., et al.

Details
Standardized datasets for offline RL research, potentially useful for RL-based VLA methods.

4.3 Simulation Environments

Physics-based simulators used to train agents, generate synthetic data, and evaluate policies in controlled settings before real-world deployment.

NVIDIA Isaac Sim / Orbit / OmniGibson [Isaac-sim, Orbit, OmniGibson].

Details
High-fidelity, physically realistic simulators based on NVIDIA Omniverse. Used for QUAR-VLA, ReKep, ARNOLD, etc.
Habitat Sim [Project] - Facebook AI Research (Meta AI).

Details
Flexible, high-performance 3D simulator for Embodied AI research, especially navigation.
MuJoCo (Multi-Joint dynamics with Contact) [Project].

Details
Popular physics engine widely used for robot simulation and RL benchmarks (dm\_control, robosuite, Meta-World, RoboHive).
PyBullet [Project.]

Details
Open-source physics engine, used for CALVIN and other benchmarks (panda-gym).
SAPIEN [Project.]

Details
Physics simulator focused on articulated objects and interaction. Used for the ManiSkill benchmark.
Gazebo [Project.]

Details
Widely used open-source robot simulator, especially in the ROS ecosystem.
Webots [Project].

Details
Open-source desktop robot simulator.
Genesis (GitHub).

Details
A newer platform aimed at general robot/Embodied AI simulation.
UniSim [arXiv] - Yang, G., et al

Details
Learns interactive simulators from real-world videos.

4.4 Evaluation Benchmarks

Standardized suites of environments and tasks used to evaluate and compare the performance of VLA models and other robot learning algorithms.

[2026] NavTrust: Benchmarking Trustworthiness for Embodied Navigation [paper] [project] [code]
[2026] HUGE-Bench: A Benchmark for High-Level UAV Vision-Language-Action Tasks [paper]
[2026] Can LLMs Prove Robotic Path Planning Optimality? A Benchmark for Research-Level Algorithm Verification [paper]
CALVIN [Project].

Details
Benchmark for long-horizon language-conditioned manipulation.
ManiSkill (1, 2, 3) [Project]

Details
Large-scale benchmark for generalizable manipulation skills, based on SAPIEN.
Meta-World [Project].

Details
Multi-task and meta-RL benchmark with 50 different manipulation tasks using a Sawyer arm in MuJoCo.
RLBench [Project].

Details
Large-scale benchmark with 100+ manipulation tasks in CoppeliaSim (V-REP).
Franka Kitchen [GitHub].

Details
dm\_control-based benchmark involving kitchen tasks with a Franka arm. Used in iRe-VLA.
LIBERO [Project].

Details
Benchmark for lifelong/continual learning in robot manipulation.
VIMA-Bench [Project].

Details
Multimodal few-shot prompting benchmark for robot manipulation.
BEHAVIOR-1K [Project].

Details
Benchmark focused on long-horizon household activities.
VLN-CE / R2R / RxR [VLN-CE,R2R-EnvDrop,RxR].

Details
Standard benchmarks for Vision-Language Navigation, often run in Habitat. NaVILA is evaluated on these.
Safety-CHORES [paper].

Details
A new simulated benchmark with safety constraints, proposed for evaluating safe VLA learning.
OK-VQA [Project].

Details
Visual question answering benchmark requiring external knowledge, used to evaluate the general VLM abilities of [PaLM-E](https://arxiv.org/abs/2303.03378).

5. Challenges and Future Directions

Data Efficiency & Scalability: Reducing reliance on massive, expensive, expert-driven datasets. Improving the ability to learn from limited, mixed-quality, or internet-sourced data. Efficiently scaling models and training processes.
- Future directions: Improved sample efficiency (RL, self-supervision), sim-to-real transfer, automated data generation, efficient architectures (SSMs, MoEs), data filtering/weighting.
Inference Speed & Real-Time Control: Current large VLAs may be too slow for the high-frequency control loops needed for dynamic tasks or dexterous manipulation.
- Future directions: Smaller/compact models (TinyVLA), efficient architectures (RoboMamba), parallel decoding (PD-VLA), action chunking optimization (FAST), model distillation (OneDP, RPD ), hardware acceleration.
Robustness & Reliability: Ensuring consistent performance across variations in environment, lighting, object appearance, disturbances, and unexpected events. Current models can be brittle.
- Future directions: Adversarial training, improved grounding, better 3D understanding, closed-loop feedback, anomaly detection, incorporating physical priors, testing frameworks (VLATest).
Generalization: Improving the ability to generalize to new tasks, objects, instructions, environments, and embodiments beyond the training distribution. This is a core promise of VLAs, but remains a challenge.
- Future directions: Training on more diverse data (OpenX), effective utilization of VLM pre-training knowledge, compositional reasoning, continual/lifelong learning, better action representations.
Safety & Alignment: Explicitly incorporating safety constraints to prevent harm to the robot, the environment, or humans. Ensuring alignment with user intent. Crucial for real-world deployment.
- Future directions: Constrained reinforcement learning (SafeVLA), formal verification, human oversight mechanisms, robust failure detection/recovery, ethical considerations.
Dexterity & Contact-Rich Tasks: Improving performance on tasks requiring fine motor skills, precise force control, and handling complex object interactions. Current VLAs often lag behind specialized methods in this area.
- Future directions: Better action representations (FAST, Diffusion), integration of tactile sensing, improved physical understanding/simulation, hybrid control approaches.
Reasoning & Long-Horizon Planning: Enhancing the ability for multi-step reasoning, long-horizon planning, and handling complex instructions.
- Future directions: Hierarchical architectures, explicit planning modules, chain-of-thought reasoning (visual/textual), memory mechanisms, world models.
Multimodality Expansion: Integrating richer sensory inputs beyond vision + language, such as audio/speech, touch, force, 3D.
- Future directions: Developing architectures and alignment techniques for diverse modalities.

6. Related Awesome Lists

Citation

If you find this repository useful, please consider citing this list:

@misc{liu2025vlaroboticspaperslist,
    title = {Awesome-VLA-Robotics},
    author = {Jiaqi Liu},
    journal = {GitHub repository},
    url = {https://github.com/Jiaaqiliu/Awesome-VLA-Robotics},
    year = {2025},
}

Name		Name	Last commit message	Last commit date
Latest commit History 131 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Awesome VLA for Robotics

Table of Contents

1. What are VLA Models in Robotics?

2. Survey papers

3. Key VLA Models and Research Papers

3.1 Quick Glance at Key VLA Models

3.2 By Application Area

3.2.1 Manipulation

2026

2025

2024

2023

2022

3.2.2 Navigation and Mobile Manipulation

3.2.3 Human-Robot Interaction (HRI)

3.2.4 Task Planning / Reasoning

3.2.5 Humanoid

3.2.6 Other

3.3 By Technical Approach

3.3.1 Model Architectures

3.3.2 Action Representation & Generation

3.3.3 Learning Paradigms

3.3.4 Input Modalities & Grounding

3.3.5 Fine-tuning

4. Datasets and Benchmarks

4.1 Quick Glance at Datasets and Benchmarks

4.2 Robot Learning Datasets

4.3 Simulation Environments

4.4 Evaluation Benchmarks

5. Challenges and Future Directions

6. Related Awesome Lists

Citation

Star History

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages