PPO vs PPO+RND — Minesweeper (6×6×6)
Main Site

PPO vs PPO+RND on Minesweeper (6×6×6)

Pixel-based agent comparison on Minesweeper with invalid-action masking and shaped rewards.

Goal: Learn to solve a 6×6 Minesweeper with 6 mines from raw pixels using PPO, and assess the impact of Random Network Distillation (RND) for intrinsic exploration.

Result: Across matched evaluations, PPO+RND achieves higher win rates and better average rewards than PPO-only at comparable or fewer environment steps.

Best observed: PPO+RND ~100k steps → win_rate=0.050 (avg_reward=3.17); ~1.1M steps → win_rate=0.049 (avg_reward=3.26). PPO-only ~800k steps → win_rate=0.032 (avg_reward=1.16).

Takeaway: RND improves early sample-efficiency and maintains a lead at scale on this task.

Environment and observation

Algorithms

Training setup (core)

Evaluation protocol

Results (latest evaluation)

PPO-only checkpoints

stepswin_rateavg_reward
~400k0.012-1.10
~800k0.0321.16
~1.0M0.017-0.55

PPO+RND checkpoints

stepswin_rateavg_reward
~100k0.0503.17
~1.1M0.0493.26
~800k0.0432.45

RND reaches win_rate ≈ 0.05 at 100k steps; PPO-only reaches ≈ 0.032 by 800k. At ~1.1M steps, RND maintains a lead (0.049 vs PPO’s best 0.032 at 800k in this run).

Analysis

Qualitative behavior

Practical recommendations

Limitations and future work

Appendix: key hyperparameters

Reproducibility

  1. Train (optional):
    PPO+RND experiments/ppo_rnd_minesweeper_colab.ipynb (works locally; no Colab dependency required).
    PPO-only experiments/ppo_only_minesweeper_colab.ipynb.
  2. Compare: experiments/compare_ppo_vs_rnd_colab.ipynb scans ppo_only_minesweeper and ppo_rnd_minesweeper, evaluates with shared settings, and plots win_rate vs steps.
  3. Demo videos: Use the “record MP4” cells to export short rollouts for the portfolio.

Repository pointers