PPO vs PPO+RND on Minesweeper (6×6×6)
Pixel-based agent comparison on Minesweeper with invalid-action masking and shaped rewards.
Goal: Learn to solve a 6×6 Minesweeper with 6 mines from raw pixels using PPO, and assess the impact of Random Network Distillation (RND) for intrinsic exploration.
Result: Across matched evaluations, PPO+RND achieves higher win rates and better average rewards than PPO-only at comparable or fewer environment steps.
Best observed: PPO+RND ~100k steps → win_rate=0.050 (avg_reward=3.17); ~1.1M steps → win_rate=0.049 (avg_reward=3.26). PPO-only ~800k steps → win_rate=0.032 (avg_reward=1.16).
Takeaway: RND improves early sample-efficiency and maintains a lead at scale on this task.
Environment and observation
- Task: Minesweeper, board 6×6 with 6 mines.
- Observation: visual RGB frames rendered via pygame (no window), shape H×W×3, scaled to [0,255].
- Actions: Discrete, one action per cell (reveal).
- Invalid-action mask: already-revealed cells are masked from the policy logits at decision time.
- Rewards (shaped): +1 for safe cell; −10 for mine; +50 (+efficiency bonus) on win; −2 for invalid action; additional small information bonuses for helpful reveals.
Algorithms
- PPO (baseline): CNN encoder → policy logits + single value head; clipped objective, GAE, entropy bonus, value loss, grad clipping; invalid-action masking applied to logits.
- PPO+RND (intrinsic exploration): Adds RND module (frozen target CNN, trainable predictor CNN) on next observation; intrinsic reward ∝ prediction error. Two value heads in actor-critic (V_ext, V_int) trained with respective returns; policy advantage = sum of normalized ext+int advantages.
Training setup (core)
- Board: 6×6, mines=6;
cell_size=20
. - PPO hparams (typical):
rollout_len=128
,epochs=4
,minibatch_size=512
,learning_rate=2.5e-4
,clip_coef=0.2
,ent_coef=0.003–0.01
,vf_coef=0.5
,max_grad_norm=0.5
. - RND hparams: intrinsic coefficient 0.1→0.02 (optional decay), predictor Adam
lr=2.5e-4
. - Determinism: seeds set for Python/NumPy/PyTorch; visual pipeline deterministic at inference.
Evaluation protocol
- Each checkpoint evaluated for a fixed number of steps with the same environment settings and masks.
- Metrics: win_rate = wins / episodes; avg_reward = mean per-episode reward.
- Latest settings:
eval_steps = 10_000
,seeds = 2
, board 6×6×6,cell_size=20
. Checkpoint folders:ppo_only_minesweeper
,ppo_rnd_minesweeper
.
Results (latest evaluation)
PPO-only checkpoints
steps | win_rate | avg_reward |
---|---|---|
~400k | 0.012 | -1.10 |
~800k | 0.032 | 1.16 |
~1.0M | 0.017 | -0.55 |
PPO+RND checkpoints
steps | win_rate | avg_reward |
---|---|---|
~100k | 0.050 | 3.17 |
~1.1M | 0.049 | 3.26 |
~800k | 0.043 | 2.45 |
RND reaches win_rate ≈ 0.05 at 100k steps; PPO-only reaches ≈ 0.032 by 800k. At ~1.1M steps, RND maintains a lead (0.049 vs PPO’s best 0.032 at 800k in this run).
Analysis
- Sample-efficiency: Intrinsic signal accelerates exploration, yielding higher win rates earlier.
- Stability/convergence: RND keeps a consistent lead at higher steps; PPO-only trends upward more slowly.
- Avg_reward vs win_rate: Reward shaping is noisy; we prioritize win_rate on this sparse, combinatorial puzzle.
Qualitative behavior
- RND policies open larger zero-regions early, leveraging the mask to avoid futile clicks; PPO-only learns similar behaviors later.
- Inline rollouts and MP4/GIF clips can be exported from the notebooks for portfolio embedding.
Practical recommendations
- Portfolio demo: present the best RND checkpoint (e.g., around 100k or 1.1M steps) with a short clip, and show the comparison plot + tables with brief commentary on sample-efficiency.
- If training further: PPO-only → extend +300k–500k steps; anneal
ent_coef → 0.001
andlearning_rate → 1e-4
. PPO+RND → extend +200k–400k with intrinsic/entropy decay.
Limitations and future work
- Pixel-only observations force implicit learning of game logic; structured features or hybrid inputs may boost sample-efficiency.
- Minesweeper’s stochasticity and sparse terminal rewards remain challenging; curriculum (e.g., 5×5→6×6) is promising.
- Additional baselines (Distributional DQN, IMPALA, PPO with curiosity variants) would further contextualize RND’s gains.
Appendix: key hyperparameters
- PPO:
rollout_len=128
,epochs=4
,minibatch_size=512
,learning_rate=2.5e-4
,clip_coef=0.2
,ent_coef=[0.003–0.01]
,vf_coef=0.5
,max_grad_norm=0.5
. - RND: predictor Adam
lr=2.5e-4
, intrinsic coefficient typically[0.1→0.02]
(linear decay recommended), running mean/std normalization on intrinsic error. - Eval:
eval_steps=10_000
,seeds=2
, board 6×6, mines=6,cell_size=20
.
Reproducibility
- Train (optional):
PPO+RNDexperiments/ppo_rnd_minesweeper_colab.ipynb
(works locally; no Colab dependency required).
PPO-onlyexperiments/ppo_only_minesweeper_colab.ipynb
. - Compare:
experiments/compare_ppo_vs_rnd_colab.ipynb
scansppo_only_minesweeper
andppo_rnd_minesweeper
, evaluates with shared settings, and plots win_rate vs steps. - Demo videos: Use the “record MP4” cells to export short rollouts for the portfolio.
Repository pointers
- Training notebooks:
experiments/ppo_rnd_minesweeper_colab.ipynb
,experiments/ppo_only_minesweeper_colab.ipynb
. - Comparison notebook:
experiments/compare_ppo_vs_rnd_colab.ipynb
. - Checkpoints (user setup):
ppo_rnd_minesweeper/
,ppo_only_minesweeper/
.