Sim-to-Real Transfer
Learning Outcomes
By the end of this chapter, you should be able to:
- Understand the reality gap and why sim-to-real transfer is challenging
- Implement domain randomization to improve model robustness
- Train reinforcement learning policies in simulation
- Deploy simulation-trained models to real robots
- Evaluate and debug sim-to-real transfer failures
The Physics (Why)
The reality gap is the difference between simulated and real-world physics. Simulators make approximations:
- Friction: Real friction is complex (static vs. dynamic, surface variations)
- Contact dynamics: Real collisions involve deformation, vibration
- Actuator dynamics: Real motors have delays, backlash, temperature effects
- Sensor noise: Real sensors have noise patterns simulators can't perfectly model
A policy trained in simulation may fail catastrophically on a real robot because it learned to exploit simulator quirks rather than robust physical principles.
The Analogy (Mental Model)
Think of sim-to-real transfer like learning to drive in a video game, then driving a real car:
| Video Game | Real World |
|---|---|
| Perfect traction | Ice, gravel, wet roads |
| Instant response | Steering delay, brake fade |
| No consequences | Accidents cause damage |
| Reset on failure | No respawns |
To succeed, you need to:
- Train with variety: Different weather, road conditions
- Add noise: Imperfect controls, sensor lag
- Be conservative: Don't rely on perfect physics
The Visualization (Sim-to-Real Pipeline)
graph TB
subgraph "Simulation Training"
A[Isaac Sim] --> B[Domain Randomization]
B --> C[RL Training]
C --> D[Policy Network]
end
subgraph "Transfer Techniques"
D --> E[System Identification]
D --> F[Fine-tuning]
D --> G[Residual Learning]
end
subgraph "Real Robot"
E --> H[Calibrated Sim]
F --> I[Real Data]
G --> J[Hybrid Policy]
H --> K[Deployment]
I --> K
J --> K
end
K --> L[Performance Evaluation]
L -->|Gap detected| B
The Code (Implementation)
Domain Randomization
#!/usr/bin/env python3
"""
domain_randomization.py - Randomize simulation parameters for robust training.
"""
import numpy as np
from dataclasses import dataclass
from typing import Dict, Tuple
import random
@dataclass
class RandomizationConfig:
"""Configuration for domain randomization."""
# Physics parameters
friction_range: Tuple[float, float] = (0.5, 1.5)
mass_scale_range: Tuple[float, float] = (0.8, 1.2)
motor_strength_range: Tuple[float, float] = (0.9, 1.1)
# Sensor noise
observation_noise_std: float = 0.01
action_delay_range: Tuple[int, int] = (0, 3) # frames
# Visual randomization
lighting_range: Tuple[float, float] = (0.5, 1.5)
texture_randomization: bool = True
class DomainRandomizer:
"""Apply domain randomization to simulation."""
def __init__(self, config: RandomizationConfig):
self.config = config
self.current_params = {}
def randomize_physics(self) -> Dict[str, float]:
"""Randomize physics parameters."""
params = {
'friction': np.random.uniform(*self.config.friction_range),
'mass_scale': np.random.uniform(*self.config.mass_scale_range),
'motor_strength': np.random.uniform(*self.config.motor_strength_range),
}
self.current_params.update(params)
return params
def add_observation_noise(self, obs: np.ndarray) -> np.ndarray:
"""Add noise to observations."""
noise = np.random.normal(0, self.config.observation_noise_std, obs.shape)
return obs + noise
def apply_action_delay(self, action: np.ndarray, action_buffer: list) -> np.ndarray:
"""Apply random action delay."""
delay = np.random.randint(*self.config.action_delay_range)
action_buffer.append(action)
if len(action_buffer) > delay:
return action_buffer.pop(0)
return np.zeros_like(action)
def randomize_visual(self) -> Dict[str, float]:
"""Randomize visual parameters."""
params = {
'lighting_intensity': np.random.uniform(*self.config.lighting_range),
'texture_id': random.randint(0, 100) if self.config.texture_randomization else 0,
}
self.current_params.update(params)
return params
class RobustTrainingEnv:
"""Training environment with domain randomization."""
def __init__(self, base_env, randomizer: DomainRandomizer):
self.base_env = base_env
self.randomizer = randomizer
self.action_buffer = []
self.episode_count = 0
def reset(self):
"""Reset environment with new randomization."""
self.episode_count += 1
self.action_buffer = []
# Randomize at start of each episode
physics_params = self.randomizer.randomize_physics()
visual_params = self.randomizer.randomize_visual()
# Apply to simulation (implementation depends on simulator)
self.apply_randomization(physics_params, visual_params)
obs = self.base_env.reset()
return self.randomizer.add_observation_noise(obs)
def step(self, action):
"""Step with randomization applied."""
# Apply action delay
delayed_action = self.randomizer.apply_action_delay(
action, self.action_buffer
)
obs, reward, done, info = self.base_env.step(delayed_action)
# Add observation noise
noisy_obs = self.randomizer.add_observation_noise(obs)
return noisy_obs, reward, done, info
def apply_randomization(self, physics_params, visual_params):
"""Apply randomization to simulator."""
# Implementation depends on specific simulator
pass
Reinforcement Learning Training
#!/usr/bin/env python3
"""
rl_training.py - Train walking policy with PPO.
"""
import numpy as np
import torch
import torch.nn as nn
from typing import Tuple
class WalkingPolicy(nn.Module):
"""Neural network policy for humanoid walking."""
def __init__(self, obs_dim: int, action_dim: int, hidden_dim: int = 256):
super().__init__()
# Actor network (policy)
self.actor = nn.Sequential(
nn.Linear(obs_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, action_dim),
nn.Tanh() # Actions in [-1, 1]
)
# Critic network (value function)
self.critic = nn.Sequential(
nn.Linear(obs_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, 1)
)
# Log standard deviation for action distribution
self.log_std = nn.Parameter(torch.zeros(action_dim))
def forward(self, obs: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
"""Get action mean and value estimate."""
action_mean = self.actor(obs)
value = self.critic(obs)
return action_mean, value
def get_action(self, obs: torch.Tensor, deterministic: bool = False):
"""Sample action from policy."""
action_mean, _ = self.forward(obs)
if deterministic:
return action_mean
std = torch.exp(self.log_std)
dist = torch.distributions.Normal(action_mean, std)
action = dist.sample()
return action, dist.log_prob(action).sum(-1)
class WalkingReward:
"""Reward function for humanoid walking."""
def __init__(self):
# Reward weights
self.forward_weight = 1.0
self.alive_bonus = 0.1
self.energy_penalty = 0.01
self.stability_weight = 0.5
def compute(
self,
velocity: np.ndarray,
joint_torques: np.ndarray,
orientation: np.ndarray,
is_fallen: bool
) -> float:
"""Compute reward for current state."""
if is_fallen:
return -10.0 # Large penalty for falling
# Forward velocity reward
forward_reward = self.forward_weight * velocity[0]
# Alive bonus
alive_reward = self.alive_bonus
# Energy penalty (encourage efficient movement)
energy_penalty = self.energy_penalty * np.sum(joint_torques ** 2)
# Stability reward (upright orientation)
# orientation is roll, pitch, yaw
stability_reward = self.stability_weight * (
1.0 - abs(orientation[0]) - abs(orientation[1])
)
total_reward = forward_reward + alive_reward - energy_penalty + stability_reward
return total_reward
# Training loop (simplified)
def train_walking_policy(env, policy, num_episodes=1000):
"""Train walking policy with PPO."""
optimizer = torch.optim.Adam(policy.parameters(), lr=3e-4)
reward_fn = WalkingReward()
for episode in range(num_episodes):
obs = env.reset()
episode_reward = 0
while True:
obs_tensor = torch.FloatTensor(obs).unsqueeze(0)
action, log_prob = policy.get_action(obs_tensor)
next_obs, _, done, info = env.step(action.numpy().flatten())
# Compute custom reward
reward = reward_fn.compute(
velocity=info.get('velocity', np.zeros(3)),
joint_torques=info.get('torques', np.zeros(12)),
orientation=info.get('orientation', np.zeros(3)),
is_fallen=info.get('fallen', False)
)
episode_reward += reward
obs = next_obs
if done:
break
if episode % 100 == 0:
print(f"Episode {episode}: Reward = {episode_reward:.2f}")
Deployment to Real Robot
#!/usr/bin/env python3
"""
real_robot_deployment.py - Deploy trained policy to real robot.
"""
import rclpy
from rclpy.node import Node
from sensor_msgs.msg import JointState, Imu
from std_msgs.msg import Float64MultiArray
import torch
import numpy as np
class RealRobotController(Node):
"""Deploy simulation-trained policy to real robot."""
def __init__(self, policy_path: str):
super().__init__('real_robot_controller')
# Load trained policy
self.policy = torch.load(policy_path)
self.policy.eval()
# State buffers
self.joint_positions = np.zeros(12)
self.joint_velocities = np.zeros(12)
self.imu_orientation = np.zeros(4)
self.imu_angular_vel = np.zeros(3)
# Subscribers
self.joint_sub = self.create_subscription(
JointState, '/joint_states', self.joint_callback, 10
)
self.imu_sub = self.create_subscription(
Imu, '/imu', self.imu_callback, 10
)
# Publisher for joint commands
self.cmd_pub = self.create_publisher(
Float64MultiArray, '/joint_commands', 10
)
# Control loop at 100 Hz
self.timer = self.create_timer(0.01, self.control_loop)
# Safety limits
self.max_joint_velocity = 2.0 # rad/s
self.max_torque = 50.0 # Nm
self.get_logger().info('Real robot controller started')
def joint_callback(self, msg: JointState):
"""Update joint state from robot."""
self.joint_positions = np.array(msg.position[:12])
self.joint_velocities = np.array(msg.velocity[:12])
def imu_callback(self, msg: Imu):
"""Update IMU state from robot."""
self.imu_orientation = np.array([
msg.orientation.x, msg.orientation.y,
msg.orientation.z, msg.orientation.w
])
self.imu_angular_vel = np.array([
msg.angular_velocity.x,
msg.angular_velocity.y,
msg.angular_velocity.z
])
def build_observation(self) -> np.ndarray:
"""Build observation vector matching training."""
obs = np.concatenate([
self.joint_positions,
self.joint_velocities,
self.imu_orientation,
self.imu_angular_vel
])
return obs
def control_loop(self):
"""Main control loop."""
# Build observation
obs = self.build_observation()
obs_tensor = torch.FloatTensor(obs).unsqueeze(0)
# Get action from policy
with torch.no_grad():
action = self.policy.get_action(obs_tensor, deterministic=True)
action = action.numpy().flatten()
# Apply safety limits
action = self.apply_safety_limits(action)
# Publish command
cmd_msg = Float64MultiArray()
cmd_msg.data = action.tolist()
self.cmd_pub.publish(cmd_msg)
def apply_safety_limits(self, action: np.ndarray) -> np.ndarray:
"""Apply safety limits to actions."""
# Clip to safe range
action = np.clip(action, -1.0, 1.0)
# Scale to actual joint limits
# (depends on robot configuration)
return action
def main(args=None):
rclpy.init(args=args)
controller = RealRobotController('walking_policy.pt')
rclpy.spin(controller)
controller.destroy_node()
rclpy.shutdown()
if __name__ == '__main__':
main()
The Hardware Reality (Warning)
Sim-to-Real Failures
Common failure modes when deploying to real robots:
- Actuator saturation: Real motors can't match simulated torques
- Sensor delays: Real sensors have latency simulation ignores
- Unmodeled dynamics: Cable drag, joint friction, thermal effects
- Environmental differences: Floor texture, lighting, obstacles
Always start with conservative policies and gradually increase aggressiveness.
Testing Protocol
Before full deployment:
- Desk test: Run policy with robot suspended
- Harness test: Robot on safety harness
- Slow motion: Run at 50% speed
- Supervised: Human ready to catch/stop
- Full speed: Only after all above pass
Debugging Sim-to-Real Gaps
| Symptom | Likely Cause | Solution |
|---|---|---|
| Falls immediately | Friction mismatch | Increase friction randomization |
| Drifts sideways | IMU calibration | Calibrate real IMU, add bias randomization |
| Jerky motion | Action delay | Add delay randomization in training |
| Overheats motors | Torque limits | Add energy penalty, reduce action scale |
Assessment
Recall
- What is the reality gap and why does it exist?
- Name three types of domain randomization.
- What is the purpose of adding observation noise during training?
- Why should you start with conservative policies on real robots?
Apply
- Implement domain randomization for friction and mass in a simple simulation.
- Design a reward function for a humanoid robot that encourages stable walking.
- Write a safety wrapper that limits joint velocities and torques.
Analyze
- Compare the trade-offs between domain randomization and system identification.
- Why might a policy that works well in simulation with randomization still fail on a real robot?
- Design an experiment to identify which simulation parameters most affect real-world performance.