My Reinforcement Learning Journey

Interactive demonstrations, research notes, and experiments as I move from theory to deployed RL systems.

April 2025 3 min read

Reinforcement Learning Policy Gradients Gridworld Interactive Demo

Introduction

A living lab notebook for everything I'm learning about RL — from Bellman backups to curiosity-driven exploration.

I'm documenting the path from foundational algorithms to production-grade RL agents. It starts with grid worlds and value iteration, then scales to policy gradients, model-based insights, and curiosity-driven exploration. Each experiment emphasizes intuition, visualization, and reproducibility.

Learning Resources

Hands-on tools that help me ground mathematical ideas in interactive intuition:

Bellman Equation Visualizer — animate how value iteration and Q-learning update estimates across a grid world.
Policy Gradient & PPO Intuition — explore the clipped objective, KL penalties, and sampling variance with interactive plots.
PPO vs PPO + RND — analyze how intrinsic motivation affects exploration on a 6×6×6 Minesweeper board.

Each resource links to code, notes, and follow-up experiments so the journey remains transparent and replicable.

Featured Hugging Face RL Models

A rotating gallery of RL checkpoints I've published to Hugging Face — demos range from curiosity in Atari to robotic manipulation.

Unable to load featured models right now.

Course Projects (Hugging Face Deep RL)

Assignments completed during the Hugging Face Deep RL course, tuned and annotated with post-course insights.

Unable to load coursework models right now.

Gridworld Navigation

A sandbox for testing intuition around value propagation, exploration, and sample efficiency.

Deep Q-Network in Gridworld

This environment drops an agent into a stochastic grid with moving goals and fixed walls. The DQN learns to balance exploration and exploitation while tracking long-horizon rewards.

Deep Q-Network (DQN) with experience replay and target network synchronization.
Epsilon scheduling that starts greedy and decays toward focused exploitation.
Reward shaping to encourage faster convergence without destabilizing learning.

Loading trajectory data...

Episode

0/100

Steps

Reward

Implementation Details

The current DQN configuration:

Learning rate: 1e-3 with Adam optimizer
Discount factor (γ): 0.99
Epsilon schedule: 1.0 → 0.01 with 0.997 decay
Reward shape: -0.01 per step, +1.0 for reaching the goal
Network: two-layer MLP (64 units each, ReLU)
Batch size: 64 sampled from replay buffer

Upcoming experiments: prioritised replay, double Q-learning, and distributional value heads for richer uncertainty estimates.

Next Experiments

A roadmap of environments and papers I'm excited to implement next.

LunarLander-v2 — applying PPO with reward shaping and trajectory visualizers.
Multi-Agent Soccer — extending curiosity-driven exploration to cooperative settings.

I'll continue to publish checkpoints and write-ups as results mature. Suggestions are welcome — reach out if there's an environment or paper you'd like to see replicated.