Inverse Reinforcement Learning
Date: 2026-01-30 Estimated Reading Time: 30 min Author: Yuan Zhang
Imitation Learning (IL) and Inverse Reinforcement Learning (IRL) address the problem of learning behavior from expert demonstrations, especially when defining an explicit reward function is difficult.
[Papers: HLFH, GAIL]
1. Problem Setup
Given expert trajectories: [ \tau_E = {(s_t, a_t)} ]
Goal:
-
Learn a policy ( \pi(a s) ) (IL) - Or infer a reward function ( r(s,a) ) (IRL)
2. Behavioral Cloning (BC)
BC treats imitation as supervised learning: [ \min_\theta \mathbb{E}{(s,a)\sim \tau_E} [| \pi\theta(s) - a |^2] ]
Pros
- Simple
- Stable
Cons
- Covariate shift
- Error accumulation
3. Inverse Reinforcement Learning
IRL assumes:
The expert is (near-)optimal under an unknown reward.
Classic IRL:
- Maximum entropy IRL
- Feature matching
Main issue:
- Reward ambiguity (reward shaping equivalence)
4. Adversarial Imitation Learning
4.1 GAIL
GAIL learns a discriminator: [ D(s,a) ]
The policy is trained to fool the discriminator, similar to GANs.
Interpretation:
- Implicit reward learning
- Avoids explicit reward engineering
4.2 AIRL
AIRL introduces a structured reward: [ r(s,a) = f_\theta(s,a) + \gamma h(s’) - h(s) ]
Benefits:
- Reward transferability
- Better interpretability
5. Practical Training Strategies
- BC warm-start + GAIL fine-tuning
- Off-policy adversarial IL
- Hybrid IL + RL pipelines
6. When to Use IRL?
Good fit:
- Reward is ambiguous
- Transfer across environments is required
Poor fit:
- Dense, well-defined rewards
- Limited expert data
7. Key References
- Ho & Ermon, GAIL
- Fu et al., AIRL
- Ziebart, Maximum Entropy IRL
8. Open Problems
- Sample-efficient IL
- Multi-agent imitation
- Foundation models for IL