RoboGate: 68-Scenario Adversarial Safety Benchmark + 50K Failure Dictionary + 5-Model VLA Leaderboard #5121

liveplex-cpu · 2026-03-29T12:02:20Z

liveplex-cpu
Mar 29, 2026

Hi Isaac Lab community! Following the suggestion from @kellyguo11 on PR #5077, we're sharing RoboGate here.

What is RoboGate?

An open-source pre-deployment safety validation tool for robot manipulation policies, built on NVIDIA Isaac Sim + Newton Physics. It answers: "Is this learned policy safe to deploy on a real production line?"

Key Numbers

68 adversarial scenarios across 4 difficulty tiers (Nominal, Edge Cases, Adversarial, Domain Randomization)
50,000+ experiments across 4 robots (Franka Panda 7-DOF, UR5e 6-DOF, UR3e 6-DOF, UR10e 6-DOF)
5-metric Deployment Confidence Score (0-100) — not just binary pass/fail
Two-Stage Adaptive Sampling — boundary-focused failure discovery (Risk Model AUC 0.780)

5-Model VLA Leaderboard

Model	Params	SR	Confidence	Failure Pattern
Scripted Controller (IK)	—	100% (68/68)	76/100	—
GR00T N1.6 (NVIDIA)	3B	0% (0/68)	1/100	grasp_miss + collision
OpenVLA (Stanford + TRI)	7B	0% (0/68)	27/100	grasp_miss only, 0 collision
Octo-Base (UC Berkeley)	93M	0% (0/68)	1/100	grasp_miss 79%, collision 21%
Octo-Small (UC Berkeley)	27M	0% (0/68)	1/100	grasp_miss 79.4%, collision 20.6%

All 4 VLA models — including NVIDIA's GR00T N1.6 — score 0% on scenarios a scripted IK controller solves 100%. The bottleneck is training-deployment distribution mismatch, not model size.

Isaac Lab-Arena Integration

We've submitted a benchmark contribution to Isaac Lab-Arena:

PR: isaac-sim/IsaacLab-Arena#506
Discussion: isaac-sim/IsaacLab-Arena#508
Integrates with ArenaEnvBuilder, supports --mock mode for CI/CD

Links

Resource	Link
Paper	arXiv:2603.22126
Website + Leaderboard	robogate.io/vla
GitHub	liveplex-cpu/robogate
50K+ Dataset	HuggingFace: liveplex/robogate-failure-dictionary
HF Leaderboard Space	liveplex/robogate-vla-leaderboard
Failure Explorer	robogate.io/failures

Why This Matters for the Isaac Sim Ecosystem

RoboGate is designed to complement task-diversity benchmarks (like Lightwheel RoboFinals):

Existing benchmarks ask: "Can the policy do this task?"
RoboGate asks: "Under what exact conditions does it fail, and how dangerous is the failure?"

The 50K+ failure dictionary with boundary-focused sampling maps the precise conditions (mass, friction, lighting, clutter) where policies transition from success to failure — enabling safer real-world deployment.

Built on a single RTX 4090 by AgentAI Co., Ltd. We welcome feedback and leaderboard submissions!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RoboGate: 68-Scenario Adversarial Safety Benchmark + 50K Failure Dictionary + 5-Model VLA Leaderboard #5121

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

RoboGate: 68-Scenario Adversarial Safety Benchmark + 50K Failure Dictionary + 5-Model VLA Leaderboard #5121

Uh oh!

liveplex-cpu Mar 29, 2026

What is RoboGate?

Key Numbers

5-Model VLA Leaderboard

Isaac Lab-Arena Integration

Links

Why This Matters for the Isaac Sim Ecosystem

Replies: 0 comments

liveplex-cpu
Mar 29, 2026