📝 Paper: Dexora: Open-source VLA for High-DoF Bimanual Dexterity (ICRA 2026 submission — see
ICRA26_0209_FI.pdf) 🌐 Project page: https://dexoravla.github.io 🤗 Dataset: Dexora/Dexora_Real-World_Dataset — 12.2 K teleoperated episodes / 2.92 M frames / 40.5 h, LeRobot v2.1 standard 🤖 Hardware: 2 × 6-DoF AIRBOT arms + 2 × 12-DoF XHAND (36 controlled DoF; +3 fixed head/spine dims for SDK compatibility)
Dexora is the first open-source Vision-Language-Action (VLA) system that natively targets dual-arm, dual-hand, high-DoF dexterous manipulation. This repository releases the full training, inference and data-processing code used in the paper. Large data and pretrained weights are released on the project page (and through the HuggingFace dataset above).
The system is built around three contributions:
- Hybrid teleoperation — gross arm kinematics from a custom exoskeleton backpack + fine finger motion from markerless Apple Vision Pro tracking, driving both the physical platform and a MuJoCo digital twin (§III-A).
- Embodiment-matched corpus — 100 K simulated trajectories (§III-B) and 12.2 K real-world teleoperated episodes (released on HuggingFace), all in the LeRobot v2.1 standard.
- Discriminator-guided quality-aware training — an offline discriminator (PU loss, §III-C, Eq. 7) scores each demonstration clip; the Diffusion Transformer policy is post-trained with a weighted loss (§III-D, Eq. 8) that down-weights low-quality demonstrations.
Dexora-VLA/
├── configs/ # YAML / JSON configuration
│ ├── base_400m.yaml # 400M paper spec (28 / 1024 / 16)
│ ├── base.yaml # 1B variant (legacy)
│ ├── scoring.yaml # 30M discriminator
│ ├── cross_embodiment/ # EC-1 / EC-2 / EC-3 fine-tune configs
│ ├── zero2.json # DeepSpeed stage-2 (optional)
│ ├── dataset_control_freq.json # per-dataset control freq (paper: 20 Hz)
│ ├── finetune_datasets.json # dataset names visible to the loader
│ └── ...
├── models/
│ ├── rdt/ # Diffusion-Transformer backbone blocks
│ ├── rdt_runner.py # Stage-1/3 policy (Eq. 8 weighted MSE)
│ ├── scoring_model.py # Stage-2 30M discriminator (Eq. 7)
│ ├── sample_weighting.py # DWBC score → weight + Eq. 8 helper
│ ├── ema_model.py / hub_mixin.py
│ └── multimodal_encoder/ # SigLIP + T5 thin wrappers
├── train/
│ ├── train.py / main.py # Stage-1 pretrain
│ ├── train_scoring.py / main_scoring.py # Stage-2 discriminator (PU)
│ ├── train_posttrain.py / main_posttrain.py # Stage-3 quality-aware post-train
│ ├── dataset.py # VLAConsumerDataset
│ ├── sample.py / image_corrupt.py # eval sampler + light augs
├── data/ # Dexora dataset adapters
│ ├── lerobot_vla_dataset.py # LeRobot v2.1 (HF release)
│ ├── lerobot_vla_dataset_with_logpi.py # + per-chunk log-π attach
│ ├── bson_vla_dataset.py / *_new.py / *_with_logpi.py # legacy in-house BSON
│ ├── hdf5_vla_dataset.py # legacy HDF5 (RDT-era)
│ └── filelock.py
├── scripts/ # Pipeline + eval scripts
│ ├── analyze_episode_quality.py # §III-C Eq. (1)-(3) pre-screening → Spre
│ ├── replay_validate.py # Spre → Shigh post-validation
│ ├── compute_logpi.py # §III-C Eq. (4)-(5) log-π proxy
│ ├── eval_smoothness.py # Tab. III Acc. / Jerk metrics
│ ├── eval_action_curves.py # Fig. 11 per-joint curves
│ ├── encode_lang(_batch).py # Optional T5 language pre-encoding
│ └── run_eval_example.sh
├── dataprocess/ # BSON → LeRobot v2.1 conversion
│ ├── airbot.py / airbot_config.py # AIRBOT BSON reader + config
│ ├── airbot_lerobot.py # BSON → LeRobot v2.1 converter
│ ├── lerobot_split_merge_prcessor-main/ # LeRobot dataset surgery
│ ├── code/ # embodiment configs (aloha, realman)
│ └── README.md
├── teleop/ # Real-robot data collection + Vision-Pro teleop
│ ├── scripts/ # record_delete.py / replay.py launchers
│ ├── imitate_all/ # robot + 4-camera recorder (Imitate-All subset)
│ ├── teleop_pkg/ # Vision Pro → XHand teleop side
│ ├── data_tools/ # BSON ⇄ JSON, consistency checks
│ ├── video_tools/ # 2×2 review video generator
│ ├── camera_tools/ # USB / RealSense camera bring-up
│ └── README.md
├── deploy/ # Real-robot inference (ZMQ split: policy / arms / hands)
│ ├── dexora_inference_zmq.py # policy host (env: dexora, GPU)
│ ├── dexora_policy.py # RDTRunner + SigLIP + T5 runtime wrapper
│ ├── mmk_forwarder.py # arms forwarder (env: imitall)
│ ├── xhand_forwarder.py # hands forwarder (env: xhand_tele_env)
│ ├── mmk_xhand_config.yaml # shared runtime config
│ ├── mmk2_kdl_py-0.1.4/ # mmk2 KDL kinematics lib
│ ├── inference.sh # 3-process launcher
│ └── README.md
├── tests/ # CPU-only pytest suite
├── google/ # SigLIP / T5 download targets (see google/README.md)
├── new_lerobot_stats/ # Per-dim min/max stats (see new_lerobot_stats/README.md)
├── s1_pretrain.sh # Stage 1 launcher
├── s2a_analyze_jerk.sh # Stage 2a launcher
├── s2b_replay.sh # Stage 2b launcher
├── s2c_compute_logpi.sh # Stage 2c-1 launcher
├── s2c_train_scoring.sh # Stage 2c-2 launcher
├── s3_post_train.sh # Stage 3 launcher
├── run_all_stages.sh # End-to-end pipeline
├── pyproject.toml + requirements{,-dev}.txt
├── ICRA26_0209_FI.pdf # The paper
└── LICENSE + CITATION.cff + CONTRIBUTING.md + CODE_OF_CONDUCT.md
# 1. Conda env (Python 3.10 is required)
conda create -n dexora python=3.10 -y
conda activate dexora
# 2. PyTorch (pick your own CUDA from pytorch.org; 12.1 example here)
pip install torch==2.1.0 torchvision==0.16.0 \
--index-url https://download.pytorch.org/whl/cu121
# 3. The rest (see ``requirements.txt`` for the canonical pin list)
pip install -r requirements.txt
# 4. Editable install (registers ``configs`` / ``data`` / ``models`` / ``train``
# as importable packages and adds ``dexora-train`` console scripts).
pip install -e .
# 5. (Optional, dev only) lint + tests
pip install -r requirements-dev.txt
pre-commit install
pytest tests/ -q # 57 CPU-only tests, ~5 s
# 6. (Optional) flash-attn. The attention path falls back to PyTorch SDPA
# if this is absent, so this is purely a speed knob.
# pip install flash-attn --no-build-isolationWe pin transformers<5, huggingface_hub<0.26, diffusers<0.32,
accelerate<1.0, lerobot<0.4 and numpy<2.0. These are required: newer
versions break the is_offline_mode / LeRobot-v2.1 / imgaug interfaces
that the training stack depends on.
The Dexora real-world dataset is hosted on HuggingFace in the LeRobot v2.1 standard:
huggingface-cli download Dexora/Dexora_Real-World_Dataset \
--repo-type dataset \
--local-dir data/Dexora_Real-World_DatasetTotal ≈ 240 GB; the four task families
(airbot_pick_and_place / airbot_assemble / airbot_articulation / airbot_dexterous)
are released as separate LeRobot v2.1 datasets so you can pick one to start
with. Each subdirectory has the standard layout:
data/Dexora_Real-World_Dataset/
└── airbot_pick_and_place/
├── data/ chunk-000/episode_000000.parquet ...
├── videos/ chunk-000/observation.images.{top,wrist_left,wrist_right,front}/episode_000000.mp4
└── meta/ info.json episodes.jsonl tasks.jsonl modality.json stats.json ...
State / action dimension. The HF release stores 39-D state and action vectors. The last 3 dims (
head_joint_1,head_joint_2,spine_joint) are fixed values required by the AIRBOT SDK but are not modelled by the Dexora policy. The training loaders slice to the first 36 dims by default ([left_arm(6) | right_arm(6) | left_hand(12) | right_hand(12)]), matching paper §III-A. Set--state_dim_keep 0to keep the full 39 dims.
| Asset | Size | Default path |
|---|---|---|
| SigLIP-SO400M (vision) | ~3.7 GB | google/siglip-so400m-patch14-384/ |
| T5-v1.1-XXL (language) | ~44 GB | google/t5-v1_1-xxl/ |
huggingface-cli download google/siglip-so400m-patch14-384 \
--local-dir google/siglip-so400m-patch14-384 --local-dir-use-symlinks False
huggingface-cli download google/t5-v1_1-xxl \
--local-dir google/t5-v1_1-xxl --local-dir-use-symlinks FalseSee google/README.md for symlink options if these
encoders already exist on your machine.
dataset_statistics.json is not in the HF release because it depends on
which subset you train on. Every shell launcher below auto-generates it
once if missing; or you can pre-compute it explicitly:
python -m data.lerobot_vla_dataset --stat \
--num_samples 5000 \
--repo_dir data/Dexora_Real-World_Dataset/airbot_pick_and_place \
--output_dir new_lerobot_statsThis writes a 36-D new_lerobot_stats/dataset_statistics.json plus
state_distributions.png / action_distributions.png for a quick sanity
check. See new_lerobot_stats/README.md.
Every stage is launched by a single shell script that reads its inputs from env vars (with sensible defaults). The minimal invocation for a fresh user who just downloaded the dataset:
# All paths can be overridden via env vars; defaults shown below match the
# repo's directory layout.
export DEXORA_LEROBOT_ROOT=data/Dexora_Real-World_Dataset/airbot_pick_and_place
export DEXORA_T5=google/t5-v1_1-xxl
export DEXORA_SIGLIP=google/siglip-so400m-patch14-384
export DEXORA_STATS=new_lerobot_stats/dataset_statistics.jsonTrains the Diffusion Transformer policy for 100 K steps on the real corpus
(or replace DEXORA_LEROBOT_ROOT with your sim corpus to reproduce the
paper's sim-pretrain).
NUM_GPUS=8 MAX_TRAIN_STEPS=100000 \
OUTPUT_DIR=checkpoints/dexora-400m-pretrain \
bash s1_pretrain.sh→ Writes checkpoints/dexora-400m-pretrain/checkpoint-*/{pytorch_model.bin,config.json,ema/}.
Computes per-episode acceleration Aep (Eq. 2) and jerk Jep (Eq. 3)
under per-dim min-max normalization, then keeps
Spre = Low-20%(Aep) ∩ Low-20%(Jep) (≈ 18 % of episodes per paper).
SPRE_DIR=runs/spre bash s2a_analyze_jerk.sh
# → runs/spre/complete_analysis_results.jsonOpen-loop replays each Spre episode in the MuJoCo digital twin and keeps
the survivors that complete the task without collisions.
SPRE_DIR=runs/spre SHIGH_FILE=runs/shigh.json \
REPLAY_VERIFIER=trust_spre \
bash s2b_replay.shThe bundled --verifier trust_spre is a stub for smoke testing — it
accepts every Spre episode. Switch to --verifier energy for a cheap
kinematic heuristic, or to --verifier mujoco --twin_module path.to.your.replay
for the real MuJoCo replay. The plug-in module must expose
replay(states, actions, task_id) -> {"success": bool, "collision_free": bool}.
STAGE1_CKPT=checkpoints/dexora-400m-pretrain \
LOGPI_FILE=runs/logpi/logpi.json \
bash s2c_compute_logpi.sh
# → runs/logpi/logpi.json (\hat{log π} proxy per chunk)
# → runs/logpi/logpi_raw_E.json (raw energies E_t)The discriminator (models/scoring_model.py) ingests the scalar \hat{logπ}_t
through a small sinusoidal positional-style encoding (8 freq bands + raw)
before the linear projection. This is mathematically equivalent in capacity
to Linear(1 → hidden_size) but more numerically robust under bf16 when
the z-scored proxy sits near zero.
OUTPUT_DIR=checkpoints/dexora-scoring \
LOGPI_FILE=runs/logpi/logpi.json \
SPRE_FILE=runs/spre/complete_analysis_results.json \
SHIGH_FILE=runs/shigh.json \
bash s2c_train_scoring.sh→ Writes checkpoints/dexora-scoring/{checkpoint-*,final_model}/pytorch_model.bin.
Loads the Stage-1 policy and the frozen Stage-2 discriminator, then fine-tunes the policy on the real corpus with
where w_i = DWBC(d(ξ_i)) is computed online from the discriminator score
via the DWBC mapping (with a short linear warm-up).
STAGE1_CKPT=checkpoints/dexora-400m-pretrain \
SCORING_CKPT=checkpoints/dexora-scoring/final_model/pytorch_model.bin \
OUTPUT_DIR=checkpoints/dexora-400m-posttrain \
bash s3_post_train.shThe vanilla baseline (Tab. III "w/o discriminator" row) is reproduced by
adding EXTRA_FLAGS="--no_quality_weights".
RUN_DIR=./runs/dexora-paper-rep \
DEXORA_LEROBOT_ROOT=data/Dexora_Real-World_Dataset/airbot_pick_and_place \
bash run_all_stages.sh
# Chain stages with START_STAGE / END_STAGE, e.g.
# START_STAGE=4 END_STAGE=6 RUN_DIR=./runs/... bash run_all_stages.shdeploy/ runs a trained Dexora policy on the physical robot. The integration
is split into three single-purpose processes that talk over loopback ZMQ, so
the conflicting Python environments for the policy (GPU + torch), the
arms SDK (airbot_py) and the hands SDK (xhand_tele_ops, Python 3.8) can
coexist without dependency hell:
+-----------------------------+ ZMQ tcp://*:5556 +------------------------+
| dexora_inference_zmq.py | <------------------> | mmk_forwarder.py |
| (env: dexora, GPU) | arms, 12-D radians | (env: imitall, 3.10) |
| | ZMQ tcp://*:5557 +------------------------+
| | <------------------> | xhand_forwarder.py |
| | hands, 2×12-D rad | (env: xhand_tele_env) |
+-----------------------------+ +------------------------+
deploy/dexora_policy.py wraps RDTRunner.from_pretrained(...) plus
SigLIP-SO400M and T5-XXL into a single policy.get_action(obs) -> [L, 36]
call. The inference loop follows the paper's chunk-and-replay scheme: every
chunk_size (= L) control ticks we sample a length-L action sequence then
play it back with action_buffer[t % L].
# Terminal A — XHand forwarder (env: xhand_tele_env, Python 3.8)
conda activate xhand_tele_env
python deploy/xhand_forwarder.py --config deploy/mmk_xhand_config.yaml
# Terminal B — MMK forwarder (env: imitall, Python 3.10)
conda activate imitall
python deploy/mmk_forwarder.py --config deploy/mmk_xhand_config.yaml
# Terminal C — Dexora policy (env: dexora, GPU)
conda activate dexora
python deploy/dexora_inference_zmq.py \
--model-path checkpoints/dexora-400m-posttrain \
--config-path deploy/mmk_xhand_config.yaml \
--task-description "Pick the apple and put it on the plate." \
--save-logs --monitor-interval 1TASK_DESCRIPTION="Pick the apple and put it on the plate." \
MODEL_PATH=checkpoints/dexora-400m-posttrain \
bash deploy/inference.shWire protocol, joint limits, RealSense fallback and the full troubleshooting
checklist are in deploy/README.md.
Training (Stage-1 / Stage-3) uses a 1000-step DDPM forward process with a
cosine squaredcos_cap_v2 beta schedule, predicting the action noise
\hat{ε}_θ (paper §III-C). At inference we swap DDPM for DPMSolver++
and run only num_inference_timesteps = 5 solver steps — the setting used
to produce every number in Tab. I / II / III. Increasing it to 10–20
marginally improves smoothness on the dexterous tasks (Tab. III Acc / Jerk)
at a proportional latency cost.
Backward compatibility. Earlier Dexora checkpoints were saved with
prediction_type=sample.RDTRunner.compute_lossandscripts/compute_logpi.pyboth still handle thesamplebranch even though new training defaults toepsilon.
The on-robot recording stack (paper §III-A) lives in
teleop/. It is the same kit we used to capture the
released Dexora_Real-World_Dataset, with paths anchored at
PROJECT_ROOT so it ports cleanly to a new robot.
teleop/scripts/record_delete.py— top-level orchestrator that forks the robot recorder + the Vision-Pro teleop simultaneously, then archives the episode under a configurable root (ARCHIVE_ROOTconstant at the top of the script).teleop/imitate_all/record_4_rgb_cam.py— robot + 4-camera recorder (4× USB / RealSense → BSON), lifted from airbot Imitate-All.teleop/teleop_pkg/receive_from_vision_pro.py— pulls the Apple Vision Pro hand skeleton, retargets to the 12-DoF XHand joints, drives the hands and logsxhand_control_data.bson.teleop/scripts/replay.py— synchronized playback of a recorded episode on both arms + hands.teleop/data_tools/,teleop/video_tools/,teleop/camera_tools/— episode consistency checks, 2×2 review-video generator, USB-camera bring-up.
Two conda envs are required (the same ones the deploy/ stack uses):
imitall (Python 3.10, AIRBOT SDK) for the robot side and
xhand_tele_env (Python 3.8, xhand_tele_ops) for the Vision-Pro hand side.
See teleop/README.md for the full setup (udev rules
for the four USB cameras, Vision-Pro IP configuration, secrets layout), and
dataprocess/airbot_lerobot.py for the BSON → LeRobot v2.1 conversion that
turns a freshly recorded session into the exact layout consumed by
data/lerobot_vla_dataset.py and s1_pretrain.sh.
| Table / Figure | How to run | Knob |
|---|---|---|
| Tab. I — Basic tasks (12) | Stage-1 → Stage-3 on each task; 20 rollouts | default |
| Tab. II — Dexterous tasks (6) | Same, on the 6 dexterous tasks | default |
| Tab. III — Discriminator ablation | Run Stage-3 with and without the discriminator | EXTRA_FLAGS="--no_quality_weights" |
| Fig. 10 — Data composition | Stage-3 with sim-only / sim+50% real / sim+all real | REAL_DATA_FRACTION={0.0, 0.5, 1.0} |
| Fig. 9, Tab. II EC rows — Cross-embodiment | Stage-3 ckpt + fine-tune under each EC config | CONFIG_PATH=configs/cross_embodiment/{ec1_franka,ec2_aloha,ec3_g1_inspire}.yaml |
| Fig. 11 — Per-joint trajectories | scripts/eval_action_curves.py after rollouts |
— |
| Tab. III smoothness (Acc.↓ / Jerk↓) | scripts/eval_smoothness.py rollouts/*.json --stats_file new_lerobot_stats/dataset_statistics.json |
— |
| Component | Used for | Link |
|---|---|---|
| LeRobot v2.1 | Real-world data format | github.com/huggingface/lerobot |
| DexMimicGen | Synthetic trajectory synthesis (§III-B) | github.com/NVlabs/DexMimicGen |
| Objaverse / Objaverse-XL | Source of 3D assets for sim | objaverse.allenai.org |
| Qwen2.5-VL | VLM-driven asset mining + physical-property assignment | huggingface.co/Qwen |
| MuJoCo | Digital twin + replay post-validation | mujoco.org |
| RDT-1B | Architectural reference for the Diffusion-Transformer policy | github.com/thu-ml/RoboticsDiffusionTransformer |
| DWBC (Xu et al., ICML'22) | Score → weight mapping (§III-D, ref. [41]) | github.com/ryanxhr/DWBC |
@inproceedings{dexora2026,
title = {Dexora: Open-source VLA for High-DoF Bimanual Dexterity},
author = {Zhang, Zongzheng and Pang, Jingrui and others},
booktitle = {ICRA},
year = {2026}
}MIT — see LICENSE. Third-party components (SigLIP, T5, LeRobot,
RDT-1B reference) keep their original licenses.