Model-based Reinforcement Learning

Model-based reinforcement learning (MBRL) aims to improve sample efficiency by explicitly learning a model of the environment dynamics and using it for planning, policy optimization, or both. Compared to model-free RL, MBRL introduces inductive bias and structure, at the cost of model bias and potential instability. This post surveys modern MBRL methods with an emphasis on practical algorithmic designs and optimization choices that make them work in real systems.

[Papers: World Models, Muzero, EfficientZero, PETS, MBPO, TDMPC, Dreamer]

1. Motivation: Why Model-Based RL?

In many real-world domains (robotics, autonomous driving, manipulation), environment interactions are expensive. MBRL attempts to answer:

Can we learn a sufficiently accurate world model and exploit it to reduce real-world samples?

Benefits:

Higher data efficiency
Explicit dynamics modeling
Natural integration with classical control

Challenges:

Compounding model errors
Distribution shift when policies exploit model inaccuracies
Optimization instability when planning over learned dynamics

2. Taxonomy of Model-Based RL

2.1 Planning with Learned Models

Given Learn a dynamics model: $s_{t+1} = f_\theta(s_t, a_t)$

Then use online planning (e.g., Model Predictive Control) to select actions:

Shooting methods
Cross-Entropy Method (CEM)
Random sampling with trajectory scoring

Representative methods:

PILCO
PETS (Probabilistic Ensembles with Trajectory Sampling)

Key idea: do not train a policy directly; re-plan at every step.

2.2 Model-Based Policy Optimization

Instead of pure planning, the learned model is used to generate synthetic data for training a policy.

Typical loop:

Collect real transitions
Train a dynamics model
Roll out the policy in the model (short horizon)
Update the policy using real + imagined data

Representative methods:

MBPO
STEVE

Key tradeoff: rollout horizon vs. model bias.

2.3 Latent World Models

High-dimensional observations (images) are difficult to model directly.

Latent world models:

Learn an encoder ( z_t = e(s_t) )
Learn latent dynamics ( z_{t+1} = g(z_t, a_t) )
Train policy entirely in latent space

Representative methods:

World Models
Dreamer / DreamerV2 / DreamerV3

Advantages:

Compact representations
Improved generalization
Scales to vision-based RL

3. Handling Model Uncertainty

3.1 Ensemble Models

Train multiple models ( {f_{\theta_i}} ) and:

Sample a model per rollout
Penalize uncertainty during planning
Avoid overconfident predictions

Ensembles are one of the most effective and widely used tricks in practical MBRL.

3.2 Short-Horizon Rollouts

Long rollouts amplify small errors: [ \epsilon_{t+k} \approx O(\epsilon^k) ]

MBPO shows that many short rollouts outperform few long rollouts.

4. Optimization and Planning

Common planners:

Random shooting
CEM (iteratively refines action distributions)
Gradient-based optimization (less stable)

Hybrid designs:

Learned value function + MPC
Learned policy as proposal distribution for planning

Quick Summary

Issue	Mitigation
Compounding error	Short rollouts, ensembles
Model exploitation	Conservative policy updates
High-dimensional inputs	Latent dynamics
Training instability	Regularization, replay mixing

References

Chua et al., Deep RL in a Handful of Trials using Probabilistic Dynamics Models
Janner et al., When to Trust Your Model: MBPO
Hafner et al., Dreamer: Learning Behaviors from Pixels

Changelog

2026-01-28: create the page