Go2 Quadruped Locomotion
Abstract: This report details the implementation of a locomotion control pipeline for the Unitree Go2 quadruped. By leveraging the Genesis physics engine for high-fidelity simulation and the RSL-RL library for efficient training, I developed a policy capable of robust walking, running, and jumping. The system uses Proximal Policy Optimization (PPO) to learn complex gaits in parallelized environments.
1. Introduction: Learning to Walk
Sometime back, I started delving into quadrupeds. I then stumbled upon ETH Zurich's Robotic Systems Lab (RSL)'s rsl_rl library. I wanted to see if I could use that framework to implement teleop on the Unitree Go2, just for my learning.
The goal was straightforward: Use the existing tools to train a neural network that pilots the 12-DOF robot from scratch, enabling it to track velocity commands and perform dynamic maneuvers like jumping.
1.1 System Architecture
The project leverages existing robust tools to build a specific application:
| Component | Role |
|---|---|
| Genesis | The physics simulation backend. Handles rigid body dynamics, collisions, and contacts. |
| RSL-RL | The RL framework. I used this library to handle the PPO algorithm and training loop. |
| QuadrupedEnv | The task implementation. A custom environment I wrote to interface Genesis with the RSL-RL framework. |
Training is parallelized. Instead of running one robot at a time, I spawn 4096 environments on the GPU. This allows the agent to collect experience steps much faster.
2. The Simulation World
2.1 The Robot Model
The Unitree Go2 is a 12-DOF robot (3 actuators per leg). It is loaded into the Genesis simulation from its URDF description.
The joint configuration follows a standard quadruped layout:
- Abduction/Adduction (Hip)
- Hip Flexion/Extension (Thigh)
- Knee Flexion/Extension (Calf)
I initialize the robot in a standing pose with joint angles pre-determined by me.
Control is achieved via PD position control at the joint level. The policy outputs residual target angles, which are added to the default standing pose.
PD Control Law
The torque applied to each joint follows a standard PD controller:
where:
- is the proportional gain
- is the derivative gain
- is the target joint position from the policy
- are the current joint position and velocity
3. Observation & Action Space
3.1 Observations ()
The policy receives a 48-dimensional vector containing proprioceptive data and task commands:
- Body State: Base angular velocity, projected gravity vector (to know which way is down).
- Joint State: Positions (error relative to default) and velocities.
- History: The previous action taken (crucial for smoothness).
- Commands: Target velocities (), target height, and jump flags.
3.2 Actions ()
The network outputs 12 values corresponding to target joint positions.
To bridge the "sim-to-real" gap, I implemented a one-step action delay. The action computed at time is not applied until , mimicking the communication latency of real hardware.
4. Rewards
4.1 Tracking Rewards
The primary objective is to follow the user's command.
Linear Velocity Tracking
where and .
Angular Velocity Tracking
where and .
Height Tracking
where and .
4.2 Survival & Style Penalties
To prevent the robot from flailing or damaging itself, I added:
Action Smoothness
where . This penalizes jerky motions.
Nominal Pose Regularization
where . This keeps the robot near its natural standing configuration.
Vertical Stability (Non-Jump)
where and is the vertical velocity of the base.
Joint Torque Penalty
where . This encourages energy-efficient gaits.
Base Orientation Penalty
where . This keeps the body level.
4.3 The Jump Logic
Jumping is handled via a Finite State Machine (FSM) inside the reward function. When the "jump" command is active:
State 1: Preparation (Crouch)
Penalties are relaxed by 90% to allow the robot to crouch.
State 2: Takeoff (Peak)
where:
- (massive reward for reaching target height)
- (reward upward velocity)
- is the commanded jump height
- is an indicator function (1 if , else 0)
State 3: Landing
where and . This rewards soft landings and penalizes tipping over.
Total Reward
The final reward is the sum of all active reward components at each timestep.
5. Training with PPO
I trained the policy using Proximal Policy Optimization (PPO) provided by the RSL-RL library. The actor and critic networks are simple Multi-Layer Perceptrons (MLP) with ELU activations:
Input (48) -> 512 -> 256 -> 128 -> Output
5.1 PPO Objective Function
PPO optimizes a clipped surrogate objective to prevent destructively large policy updates:
Clipped Surrogate Loss
where:
- is the probability ratio
- is the estimated advantage (see GAE below)
- is the clipping parameter
Value Function Loss
where is the critic's value estimate and is the empirical return.
Entropy Bonus
where is the entropy coefficient, encouraging exploration.
Total Loss
where is the value function coefficient.
5.2 Generalized Advantage Estimation (GAE)
Advantages are computed using GAE- for bias-variance tradeoff:
where the TD-error is:
Parameters:
- (discount factor)
- (GAE parameter)
5.3 Training Hyperparameters
| Parameter | Value | Description |
|---|---|---|
| Learning Rate | Adam optimizer | |
| Num Epochs | 5 | PPO updates per batch |
| Mini-batch Size | 204,800 | steps |
| Clip Range | 0.2 | PPO clipping |
| Max Grad Norm | 1.0 | Gradient clipping |
| Environments | 4096 | Parallel simulations |
| Horizon | 50 | Steps before update |
| Iterations | 1500 | Total training iterations |
Training Duration: Approximately 50 minutes on an RTX 4050.
The massive batch size creates a very stable gradient estimate, allowing for aggressive learning rates without destabilizing the policy.
6. Evaluation & Teleoperation
After training, I validated the policy using a custom teleoperation script. This maps keyboard inputs to the command vector, allowing me to drive the robot around the simulation in real-time.
| Key | Command | Effect |
|---|---|---|
| W / S | Move Forward / Backward | |
| A / D | Strafe Left / Right | |
| Q / E | Turn Left / Right | |
| J | Jump | Trigger jump maneuver |
The policy proved robust to external disturbances, recovering quickly from pushes and maintaining balance even when transitioning rapidly between forward running and lateral strafing.
7. Conclusion
This project successfully demonstrated a locomotion pipeline for the Go2. It was a good learning experience for me.