My Reinforcement Learning Journey
Interactive demonstrations, research notes, and experiments as I move from theory to deployed RL systems.
Introduction
A living lab notebook for everything I'm learning about RL — from Bellman backups to curiosity-driven exploration.
I'm documenting the path from foundational algorithms to production-grade RL agents. It starts with grid worlds and value iteration, then scales to policy gradients, model-based insights, and curiosity-driven exploration. Each experiment emphasizes intuition, visualization, and reproducibility.
Learning Resources
Hands-on tools that help me ground mathematical ideas in interactive intuition:
- Bellman Equation Visualizer — animate how value iteration and Q-learning update estimates across a grid world.
- Policy Gradient & PPO Intuition — explore the clipped objective, KL penalties, and sampling variance with interactive plots.
- PPO vs PPO + RND — analyze how intrinsic motivation affects exploration on a 6×6×6 Minesweeper board.
Each resource links to code, notes, and follow-up experiments so the journey remains transparent and replicable.
Featured Hugging Face RL Models
A rotating gallery of RL checkpoints I've published to Hugging Face — demos range from curiosity in Atari to robotic manipulation.
Course Projects (Hugging Face Deep RL)
Assignments completed during the Hugging Face Deep RL course, tuned and annotated with post-course insights.
Gridworld Navigation
A sandbox for testing intuition around value propagation, exploration, and sample efficiency.
Deep Q-Network in Gridworld
This environment drops an agent into a stochastic grid with moving goals and fixed walls. The DQN learns to balance exploration and exploitation while tracking long-horizon rewards.
- Deep Q-Network (DQN) with experience replay and target network synchronization.
- Epsilon scheduling that starts greedy and decays toward focused exploitation.
- Reward shaping to encourage faster convergence without destabilizing learning.
Implementation Details
The current DQN configuration:
- Learning rate: 1e-3 with Adam optimizer
- Discount factor (γ): 0.99
- Epsilon schedule: 1.0 → 0.01 with 0.997 decay
- Reward shape: -0.01 per step, +1.0 for reaching the goal
- Network: two-layer MLP (64 units each, ReLU)
- Batch size: 64 sampled from replay buffer
Upcoming experiments: prioritised replay, double Q-learning, and distributional value heads for richer uncertainty estimates.
Next Experiments
A roadmap of environments and papers I'm excited to implement next.
- LunarLander-v2 — applying PPO with reward shaping and trajectory visualizers.
- Multi-Agent Soccer — extending curiosity-driven exploration to cooperative settings.
I'll continue to publish checkpoints and write-ups as results mature. Suggestions are welcome — reach out if there's an environment or paper you'd like to see replicated.