PPO vs PPO+RND on Minesweeper (6×6×6)

Pixel-based agent comparison on Minesweeper with invalid-action masking and shaped rewards.

Goal: Learn to solve a 6×6 Minesweeper with 6 mines from raw pixels using PPO, and assess the impact of Random Network Distillation (RND) for intrinsic exploration.

Result: Across matched evaluations, PPO+RND achieves higher win rates and better average rewards than PPO-only at comparable or fewer environment steps.

Best observed: PPO+RND ~100k steps → win_rate=0.050 (avg_reward=3.17); ~1.1M steps → win_rate=0.049 (avg_reward=3.26). PPO-only ~800k steps → win_rate=0.032 (avg_reward=1.16).

Takeaway: RND improves early sample-efficiency and maintains a lead at scale on this task.

Environment and observation

Task: Minesweeper, board 6×6 with 6 mines.
Observation: visual RGB frames rendered via pygame (no window), shape H×W×3, scaled to [0,255].
Actions: Discrete, one action per cell (reveal).
Invalid-action mask: already-revealed cells are masked from the policy logits at decision time.
Rewards (shaped): +1 for safe cell; −10 for mine; +50 (+efficiency bonus) on win; −2 for invalid action; additional small information bonuses for helpful reveals.

Algorithms

PPO (baseline): CNN encoder → policy logits + single value head; clipped objective, GAE, entropy bonus, value loss, grad clipping; invalid-action masking applied to logits.
PPO+RND (intrinsic exploration): Adds RND module (frozen target CNN, trainable predictor CNN) on next observation; intrinsic reward ∝ prediction error. Two value heads in actor-critic (V_ext, V_int) trained with respective returns; policy advantage = sum of normalized ext+int advantages.

Training setup (core)

Board: 6×6, mines=6; cell_size=20.
PPO hparams (typical): rollout_len=128, epochs=4, minibatch_size=512, learning_rate=2.5e-4, clip_coef=0.2, ent_coef=0.003–0.01, vf_coef=0.5, max_grad_norm=0.5.
RND hparams: intrinsic coefficient 0.1→0.02 (optional decay), predictor Adam lr=2.5e-4.
Determinism: seeds set for Python/NumPy/PyTorch; visual pipeline deterministic at inference.

Evaluation protocol

Each checkpoint evaluated for a fixed number of steps with the same environment settings and masks.
Metrics: win_rate = wins / episodes; avg_reward = mean per-episode reward.
Latest settings: eval_steps = 10_000, seeds = 2, board 6×6×6, cell_size=20. Checkpoint folders: ppo_only_minesweeper, ppo_rnd_minesweeper.

Results (latest evaluation)

PPO-only checkpoints

steps	win_rate	avg_reward
~400k	0.012	-1.10
~800k	0.032	1.16
~1.0M	0.017	-0.55

PPO+RND checkpoints

steps	win_rate	avg_reward
~100k	0.050	3.17
~1.1M	0.049	3.26
~800k	0.043	2.45

RND reaches win_rate ≈ 0.05 at 100k steps; PPO-only reaches ≈ 0.032 by 800k. At ~1.1M steps, RND maintains a lead (0.049 vs PPO’s best 0.032 at 800k in this run).

Analysis

Sample-efficiency: Intrinsic signal accelerates exploration, yielding higher win rates earlier.
Stability/convergence: RND keeps a consistent lead at higher steps; PPO-only trends upward more slowly.
Avg_reward vs win_rate: Reward shaping is noisy; we prioritize win_rate on this sparse, combinatorial puzzle.

Qualitative behavior

RND policies open larger zero-regions early, leveraging the mask to avoid futile clicks; PPO-only learns similar behaviors later.
Inline rollouts and MP4/GIF clips can be exported from the notebooks for portfolio embedding.

Practical recommendations

Portfolio demo: present the best RND checkpoint (e.g., around 100k or 1.1M steps) with a short clip, and show the comparison plot + tables with brief commentary on sample-efficiency.
If training further: PPO-only → extend +300k–500k steps; anneal ent_coef → 0.001 and learning_rate → 1e-4. PPO+RND → extend +200k–400k with intrinsic/entropy decay.

Limitations and future work

Pixel-only observations force implicit learning of game logic; structured features or hybrid inputs may boost sample-efficiency.
Minesweeper’s stochasticity and sparse terminal rewards remain challenging; curriculum (e.g., 5×5→6×6) is promising.
Additional baselines (Distributional DQN, IMPALA, PPO with curiosity variants) would further contextualize RND’s gains.

Appendix: key hyperparameters

PPO: rollout_len=128, epochs=4, minibatch_size=512, learning_rate=2.5e-4, clip_coef=0.2, ent_coef=[0.003–0.01], vf_coef=0.5, max_grad_norm=0.5.
RND: predictor Adam lr=2.5e-4, intrinsic coefficient typically [0.1→0.02] (linear decay recommended), running mean/std normalization on intrinsic error.
Eval: eval_steps=10_000, seeds=2, board 6×6, mines=6, cell_size=20.

Reproducibility

Train (optional):
PPO+RND experiments/ppo_rnd_minesweeper_colab.ipynb (works locally; no Colab dependency required).
PPO-only experiments/ppo_only_minesweeper_colab.ipynb.
Compare: experiments/compare_ppo_vs_rnd_colab.ipynb scans ppo_only_minesweeper and ppo_rnd_minesweeper, evaluates with shared settings, and plots win_rate vs steps.
Demo videos: Use the “record MP4” cells to export short rollouts for the portfolio.

Repository pointers

Training notebooks: experiments/ppo_rnd_minesweeper_colab.ipynb, experiments/ppo_only_minesweeper_colab.ipynb.
Comparison notebook: experiments/compare_ppo_vs_rnd_colab.ipynb.
Checkpoints (user setup): ppo_rnd_minesweeper/, ppo_only_minesweeper/.