Model-based Reinforcement Learning
Model-based reinforcement learning (MBRL) aims to improve sample efficiency by explicitly learning a model of the environment dynamics and using it for planning, policy optimization, or both. Compared to model-free RL, MBRL introduces inductive bias and structure, at the cost of model bias and potential instability. This post surveys modern MBRL methods with an emphasis on practical algorithmic designs and optimization choices that make them work in real systems.
[Papers: World Models, Muzero, EfficientZero, PETS, MBPO, TDMPC, Dreamer]
1. Motivation: Why Model-Based RL?
In many real-world domains (robotics, autonomous driving, manipulation), environment interactions are expensive. MBRL attempts to answer:
Can we learn a sufficiently accurate world model and exploit it to reduce real-world samples?
Benefits:
- Higher data efficiency
- Explicit dynamics modeling
- Natural integration with classical control
Challenges:
- Compounding model errors
- Distribution shift when policies exploit model inaccuracies
- Optimization instability when planning over learned dynamics
2. Taxonomy of Model-Based RL
2.1 Planning with Learned Models
Given Learn a dynamics model: $s_{t+1} = f_\theta(s_t, a_t)$
Then use online planning (e.g., Model Predictive Control) to select actions:
- Shooting methods
- Cross-Entropy Method (CEM)
- Random sampling with trajectory scoring
Representative methods:
- PILCO
- PETS (Probabilistic Ensembles with Trajectory Sampling)
Key idea: do not train a policy directly; re-plan at every step.
2.2 Model-Based Policy Optimization
Instead of pure planning, the learned model is used to generate synthetic data for training a policy.
Typical loop:
- Collect real transitions
- Train a dynamics model
- Roll out the policy in the model (short horizon)
- Update the policy using real + imagined data
Representative methods:
- MBPO
- STEVE
Key tradeoff: rollout horizon vs. model bias.
2.3 Latent World Models
High-dimensional observations (images) are difficult to model directly.
Latent world models:
- Learn an encoder ( z_t = e(s_t) )
- Learn latent dynamics ( z_{t+1} = g(z_t, a_t) )
- Train policy entirely in latent space
Representative methods:
- World Models
- Dreamer / DreamerV2 / DreamerV3
Advantages:
- Compact representations
- Improved generalization
- Scales to vision-based RL
3. Handling Model Uncertainty
3.1 Ensemble Models
Train multiple models ( {f_{\theta_i}} ) and:
- Sample a model per rollout
- Penalize uncertainty during planning
- Avoid overconfident predictions
Ensembles are one of the most effective and widely used tricks in practical MBRL.
3.2 Short-Horizon Rollouts
Long rollouts amplify small errors: [ \epsilon_{t+k} \approx O(\epsilon^k) ]
MBPO shows that many short rollouts outperform few long rollouts.
4. Optimization and Planning
Common planners:
- Random shooting
- CEM (iteratively refines action distributions)
- Gradient-based optimization (less stable)
Hybrid designs:
- Learned value function + MPC
- Learned policy as proposal distribution for planning
Quick Summary
| Issue | Mitigation |
|---|---|
| Compounding error | Short rollouts, ensembles |
| Model exploitation | Conservative policy updates |
| High-dimensional inputs | Latent dynamics |
| Training instability | Regularization, replay mixing |
References
- Chua et al., Deep RL in a Handful of Trials using Probabilistic Dynamics Models
- Janner et al., When to Trust Your Model: MBPO
- Hafner et al., Dreamer: Learning Behaviors from Pixels
Changelog
2026-01-28: create the page