Back to Projects

DexLite: Replicating State-of-the-Art Dexterous Grasping (On a Budget)

Note: This project is an implementation of the grasp synthesis methodology presented in the paper Dex1B: Learning with 1B Demonstrations for Dexterous Manipulation. The neural network architecture and energy functions described herein are based on their published work.

Abstract: This post details my journey building DexLite, a learning-based system for synthesizing dexterous grasps on a Shadow Hand. By adapting the massive-scale Dex1B pipeline for a standard laptop GPU, I explore the intersection of generative deep learning and physics-based optimization.


The "Fine, I'll Do It Myself" Moment

A few months ago, researchers from UC San Diego released a fascinating paper titled Dex1B: Learning with 1B Demonstrations for Dexterous Manipulation. It proposed a massive-scale approach to learning dexterous manipulation, utilizing a dataset of one billion demonstrations to solve complex grasping and articulation tasks.

I read it, thought it was brilliant, and immediately went hunting for the code. Result: No code online.

So I decided to implement the paper myself, atleast whatever I could in simulation. I wanted to understand the nuts and bolts of how they achieved such high-quality results by combining optimization with generative models. I call my implementation DexLite—a lightweight, accessible version of their massive pipeline (some corners were cut because of hardware constraints).


The Challenge: High-DOF Grasping

Why is this hard? Unlike a simple parallel-jaw gripper, the Shadow Hand possesses high degrees of freedom (DoF), making it incredibly challenging to control effectively. It is essentially a human hand.

Mathematically, given an object represented by a point cloud PRN×3\mathcal{P} \in \mathbb{R}^{N \times 3}, we need to find a hand configuration qR28\mathbf{q} \in \mathbb{R}^{28} that results in a stable, force-closure grasp:

q=[t,θrot,θjoints]\mathbf{q} = [\mathbf{t}, \boldsymbol{\theta}_{\text{rot}}, \boldsymbol{\theta}_{\text{joints}}]

where tR3\mathbf{t} \in \mathbb{R}^3 is the wrist translation, θrotR3\boldsymbol{\theta}_{\text{rot}} \in \mathbb{R}^3 is the wrist rotation (Euler angles), and θjointsR22\boldsymbol{\theta}_{\text{joints}} \in \mathbb{R}^{22} contains the 22 joint angles for the Shadow Hand's fingers.

Finding a valid configuration for these 28 parameters that results in a stable grasp is an optimization problem which is very slow for generating large datasets. The Dex1B paper solves this by identifying two key issues in generative models: feasibility (lower success rates) and diversity (tendency to interpolate rather than expand).


How DexLite Works

My implementation follows the core philosophy of the paper, integrating geometric constraints into a generative model. The pipeline consists of three main stages:

  1. The Neural Network: The heart of the system is a conditional generative model. It starts with PointNet, which processes the object's point cloud to extract global geometric features and local features for specific surface points.

    PointNet uses a hierarchical architecture with 1D convolutions (3→64→128→1024→256 channels) followed by symmetric max pooling to achieve permutation invariance:

    fglobal=maxi[1,N]MLP(pi)R256\mathbf{f}_{\text{global}} = \max_{i \in [1,N]} \text{MLP}(\mathbf{p}_i) \in \mathbb{R}^{256}

    These features are fed into a Conditional Variational Autoencoder (CVAE), which learns the conditional distribution p(qP)p(\mathbf{q}|\mathcal{P}) of grasps given object geometry.

    Network Architecture

    The CVAE uses the reparameterization trick to enable backpropagation:

    Encoder (training only): qϕ(zq,c)=N(μ,σ2)q_\phi(\mathbf{z}|\mathbf{q}, \mathbf{c}) = \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\sigma}^2)

    Sampling: z=μ+σϵ\mathbf{z} = \boldsymbol{\mu} + \boldsymbol{\sigma} \odot \boldsymbol{\epsilon}, where ϵN(0,I)\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})

    Decoder: pθ(qz,c)p_\theta(\mathbf{q}|\mathbf{z}, \mathbf{c})

    The CVAE structure allows for two distinct modes of operation:

    • Dataset Expansion: During training, we can input existing valid grasps along with the object features into the Encoder to map them to a latent space. By slightly varying the "associated point" (the target point on the object where the grasp approaches from) or the latent vector, we can decode variations of known successful grasps, effectively multiplying our dataset.
    • Pure Synthesis: For generating completely new grasps at inference time, we bypass the Encoder entirely. We sample random noise zN(0,I)\mathbf{z} \sim \mathcal{N}(\mathbf{0}, \mathbf{I}) and feed it into the Decoder along with the object features. The Decoder then "hallucinates" a valid grasp configuration from scratch.
  2. The Losses: You can't just train this on visual similarity (MSE) alone. To ensure the generated hands don't look like spaghetti or clip through the object, I implemented a comprehensive set of energy functions:

    • Reconstruction Loss: Lrecon=1Bb=1Bqbq^b22\mathcal{L}_{\text{recon}} = \frac{1}{B} \sum_{b=1}^{B} \|\mathbf{q}_b - \hat{\mathbf{q}}_b\|_2^2 - Keeps the generated grasp close to the ground truth during training.

    • KL Divergence: LKL=12Bb=1Bi=1256(1+logσb,i2μb,i2σb,i2)\mathcal{L}_{\text{KL}} = -\frac{1}{2B} \sum_{b=1}^{B} \sum_{i=1}^{256} \left(1 + \log \sigma_{b,i}^2 - \mu_{b,i}^2 - \sigma_{b,i}^2\right) - Regularizes the latent space so we can sample from it later.

    • Force Closure: Efc=nTGF2E_{\text{fc}} = \|\mathbf{n}^T \mathbf{G}\|_F^2 - Ensures the grasp is physically stable and resists external wrenches. Here G\mathbf{G} is the grasp matrix that maps contact forces to object wrenches:

    G=[I3I3I3[c1]×[c2]×[cK]×]\mathbf{G} = \begin{bmatrix} \mathbf{I}_3 & \mathbf{I}_3 & \cdots & \mathbf{I}_3 \\ [\mathbf{c}_1]_\times & [\mathbf{c}_2]_\times & \cdots & [\mathbf{c}_K]_\times \end{bmatrix}

    where [ck]×[\mathbf{c}_k]_\times is the skew-symmetric matrix for the cross product at contact point ck\mathbf{c}_k.

    • Penetration Penalty: Epen=m=1Mmax(0,dhand(pm)+1)E_{\text{pen}} = \sum_{m=1}^{M} \max(0, d_{\text{hand}}(\mathbf{p}_m) + 1) - Uses Signed Distance Functions (SDF) to punish fingers for clipping inside the object mesh. The +1 provides a safety margin.

    • Contact Distance: Edis=k=1Kϕ(ck)E_{\text{dis}} = \sum_{k=1}^{K} |\phi(\mathbf{c}_k)| - Acts as a magnet, pulling fingertips towards the object surface to ensure contact. ϕ(ck)\phi(\mathbf{c}_k) is the signed distance from contact point kk to the nearest object surface.

    • Self-Penetration: Espen=(i,j)Cmax(0,dij)E_{\text{spen}} = \sum_{(i,j) \in \mathcal{C}} \max(0, -d_{ij}) - Prevents the hand from colliding with itself, where dijd_{ij} is the minimum distance between finger links ii and jj.

    • Joint Limits: Ejoints=i=122max(0,qiqi+)+max(0,qiqi)E_{\text{joints}} = \sum_{i=1}^{22} \max(0, q_i - q_i^{+}) + \max(0, q_i^{-} - q_i) - Ensures the hand doesn't bend in physically impossible ways.

    The total training loss combines all these terms:

    Ltotal=wreconLrecon+wKLLKL+wfcEfc+wdisEdis+wpenEpen+wspenEspen+wjointsEjoints\mathcal{L}_{\text{total}} = w_{\text{recon}} \mathcal{L}_{\text{recon}} + w_{\text{KL}} \mathcal{L}_{\text{KL}} + w_{\text{fc}} E_{\text{fc}} + w_{\text{dis}} E_{\text{dis}} + w_{\text{pen}} E_{\text{pen}} + w_{\text{spen}} E_{\text{spen}} + w_{\text{joints}} E_{\text{joints}}

    I used weights of wrecon=1.0w_{\text{recon}}=1.0, wKL=0.001w_{\text{KL}}=0.001 (to prevent posterior collapse), wfc=1.0w_{\text{fc}}=1.0, wdis=100.0w_{\text{dis}}=100.0, wpen=100.0w_{\text{pen}}=100.0, wspen=10.0w_{\text{spen}}=10.0, and wjoints=1.0w_{\text{joints}}=1.0.

  3. Post-Optimization: A final optimization step that fine-tunes the fingers to ensure solid contact, minimizing the energy function EpostE_{post}.


Key Implementation Differences (The "Lite" Part)

The original Dex1B pipeline is designed to generate one billion demonstrations using massive compute clusters. My constraints were slightly different: I am running this on a laptop with an RTX 4050.

To make this feasible, I had to be smarter about my data:

  • Curated Data vs. Raw Generation: The paper generates a seed dataset of ~5 million poses using pure optimization. Instead of burning my GPU for weeks, I curated a high-quality subset from the existing DexGraspNet dataset.

  • Rigorous Filtering: I built a custom validation pipeline using PyBullet and MuJoCo:

    • Stability Test (PyBullet): Objects must remain stable on a table after 2 seconds of simulation. Criteria: lateral displacement <0.05< 0.05 m and angular change <7°< 7°. This filtered out ~45% of objects (spheres, thin plates, etc.), leaving ~3,000 stable objects from the original ~5,500.
    • Lift Test (MuJoCo): Each grasp is tested by lifting the object 1m upward. Success requires lift ratio >0.9> 0.9:
    rlift=zobjectendzobjectstartzhandendzhandstart>0.9r_{\text{lift}} = \frac{z_{\text{object}}^{\text{end}} - z_{\text{object}}^{\text{start}}}{z_{\text{hand}}^{\text{end}} - z_{\text{hand}}^{\text{start}}} > 0.9

    This achieved a ~90% pass rate on ~550,000 grasps, yielding ~495,000 high-quality training examples.


The Secret Sauce: Post-Optimization

Here was a big takeaway from this project: NO matter what, The neural network is not enough.

The raw output from the CVAE is good, but often suffers from lower success rates than deterministic models. It gets close to a successful grasp, but doesn't achieve it. Maybe it penetrates, or maybe it does not make contact. But tiny adjustments can make it successful.

I implemented the post-optimization step suggested in the paper. It takes the sampled hand poses and refines them using gradient-based optimization of the same energy function:

Epost=Efc+wdisEdis+wpenEpen+wspenEspen+wjointsEjointsE_{\text{post}} = E_{\text{fc}} + w_{\text{dis}} E_{\text{dis}} + w_{\text{pen}} E_{\text{pen}} + w_{\text{spen}} E_{\text{spen}} + w_{\text{joints}} E_{\text{joints}}

The optimization uses RMSProp with simulated annealing to escape local minima:

vt=μvt1+(1μ)(E)2\mathbf{v}_t = \mu \mathbf{v}_{t-1} + (1-\mu)(\nabla E)^2 qt+1=qtαvt+ϵE\mathbf{q}_{t+1} = \mathbf{q}_t - \frac{\alpha}{\sqrt{\mathbf{v}_t} + \epsilon} \nabla E

with acceptance probability P(accept)=exp(EoldEnewT)P(\text{accept}) = \exp\left(\frac{E_{\text{old}} - E_{\text{new}}}{T}\right) and temperature decay Tk=T0γk/nperiodT_k = T_0 \cdot \gamma^{\lfloor k/n_{\text{period}} \rfloor} (parameters: T0=18T_0=18, γ=0.95\gamma=0.95, nperiod=30n_{\text{period}}=30).

I run 200 iterations of this optimization, which takes only ~2 seconds per grasp compared to minutes for optimization from scratch. The results speak for themselves:

  • Raw Network Output: ~55% Success Rate (Grasps often loose or clipping).
  • With Post-Optimization: ~79% Success Rate (Tight, physically valid grasps).

As the paper notes, this hybrid approach leverages the best of both worlds: optimization ensures physical plausibility, while the generative model enables efficiency and provides semantically meaningful initializations that converge ~3x faster than random starts.


Conclusion & Future Work

Replicating Dex1B was a lesson in the importance of hybrid approaches. Deep learning provides the intuition, and classical control theory provides the precision.

I’m planning to extend this work by incorporating "Graspness" (learning which parts of an object are graspable) and potentially moving to dual-hand manipulation.



Appendix: Hyperparameters

Network Architecture

PointNet:

LayerInput ChannelsOutput ChannelsActivation
Conv1364ReLU + BN
Conv264128ReLU + BN
Conv31281024ReLU + BN
Conv41024256BN (no activation)
Max Pool256256Global max over points

CVAE Encoder:

LayerInput DimOutput DimActivation
FC1540 (28 + 512)256ReLU
FC2256512ReLU
FC3512256ReLU
FC_μ256256None
FC_σ256256None

CVAE Decoder:

LayerInput DimOutput DimActivation
FC1768 (256 + 512)256ReLU
FC2256512ReLU
FC3512256ReLU
FC_out25628None

Training Hyperparameters

ParameterValueDescription
Batch Size64Number of samples per batch
Learning Rate0.0001Adam optimizer learning rate
Epochs100Total training epochs
Latent Dimension256Dimension of latent space z\mathbf{z}
Point Cloud Size2048Number of points sampled from object

Loss Weights

WeightValuePurpose
wreconw_{\text{recon}}1.0Reconstruction fidelity
wKLw_{\text{KL}}0.001Latent space regularization
wfcw_{\text{fc}}1.0Force closure constraint
wdisw_{\text{dis}}100.0Contact distance penalty
wpenw_{\text{pen}}100.0Object penetration penalty
wspenw_{\text{spen}}10.0Self-penetration penalty
wjointsw_{\text{joints}}1.0Joint limit violation penalty

Post-Optimization Parameters

ParameterValueDescription
Iterations200Number of optimization steps
Step Size (α\alpha)0.005Initial learning rate
RMSProp Momentum (μ\mu)0.98Momentum for squared gradient
RMSProp Epsilon (ϵ\epsilon)10810^{-8}Numerical stability constant
Initial Temperature (T0T_0)18Starting annealing temperature
Temperature Decay (γ\gamma)0.95Decay factor per period
Annealing Period (nperiodn_{\text{period}})30Steps between temperature updates
Contact Switch Probability0.5Probability of switching contact points