Recent visuo-tactile grasping research has increasingly emphasized the value of combining tactile sensing with vision to enhance in-contact manipulation. However, most existing approaches:
- rely on 2D feature-space fusion, lacking explicit spatial alignment,
- contain limited annotation density, and
- do not provide ground-truth visuo-tactile correspondence in 3D.
To overcome these limitations, our work proposes:
A complete, explicitly-aligned 3D visuo-tactile learning setup consisting of:
- a large-scale multimodal grasp dataset,
- a unified 3D sensory alignment & reconstruction pipeline,
- a shape completion module to recover occluded geometry,
- a geometry-aware stability prediction network (SGA-GSN).
| Component | Status |
|---|---|
| 📁 3DA-VTG Dataset | sensory/pose/stability data collected in simulation |
| 🧰 Dataset APIs | aligned data loading, tactile depth tools, 3D reconstruction scripts |
| 🧠 Full Framework Code | unified 3D pipeline + shape completion integration |
| 🌀 SGA-GSN Network | training, testing, inference, weights |
🔔 Will be released progressively following paper acceptance. ⭐ Please Star & Watch the repository for updates.
The 3DA-VTG dataset is constructed using a simplified robot handover scenario in simulation to provide dense, explicitly aligned visuo-tactile data.
- 440K grasp trials, ~5K per object
- 88 objects (YCB, DexNet 2.0, self-collected objects from GraspNet-1Billion)
- Each grasp sample includes:
- RGB-D visual observation
- Dual GelSight tactile RGB-D images
- 6-DoF extrinsic parameters (camera, gels, object)
- Unified visuo-tactile 3D point cloud
- Stable / unstable grasp outcome
The proposed framework performs explicit spatial fusion, rather than feature-space concatenation, through three stages:
- SAM-prompted visual segmentation
- Tactile depth estimation using a transformer
- Reconstruction into a common world frame
- Produces aligned sensory 3D point clouds
- Based on AdaPoinTr
- Balanced sampling between visual & tactile inputs
- Completes occluded geometry invisible to the RGB camera
- Output: a full object point cloud
- Dual-branch feature extraction (contact vs. shape)
- Geometry-aware cross-attention fusion
- Multi-resolution spatial reasoning
- Output: binary stability label
📌 SGA-GSN serves as the stability prediction module of the 3DA-VTG framework.
| Split | Accuracy | F1 Score |
|---|---|---|
| Seen Objects | 81.9% | 79.3% |
| Unseen Objects | 80.0% | 80.7% |
If you find this framework or dataset useful, please cite:

