Bellman Equation Visualizer
An interactive guide to understanding reinforcement learning's fundamental equation
What is the Bellman Equation?
The Bellman equation forms the mathematical foundation of reinforcement learning. In simple terms, it states:
The value of your current state = immediate reward + discounted value of future states
Think of it like planning a route through a city. At each intersection (state), you decide which way to go (action) based on how good the immediate path looks (reward) and where it leads you in the long run (future value).
How to Use This Visualizer
- Learn the Basics: Start with the Bellman Equation section to understand the core formula.
- Adjust Parameters: Use the sliders to see how changing values (like discount factor) affects learning.
- Run Simulations: Step through algorithms to see how agents learn optimal policies.
- Observe Results: Watch the charts to understand convergence and reward patterns.
The Bellman Equations: Simplified
Think of the Bellman equation as a decision-making recipe. It helps an agent figure out the best actions to take in different situations.
Breaking Down the Bellman Equation
Key Insight:
The value of any state can be calculated by considering the immediate reward plus the discounted value of the next state.
How the Equation Works:
For each possible action from our current state, calculate the immediate reward.
For each possible next state, calculate its value.
Discount future values (multiply by the discount factor).
Add immediate reward to the discounted future value.
Choose the action that gives the highest total value.
Grid World Example: See Bellman in Action
Goal: Navigate from start (top-left) to goal (bottom-right) avoiding obstacles.
Reward: +1 for reaching the goal, -0.01 for each step (encourages efficiency).
What's Happening:
1. The agent uses value iteration (based on Bellman equations) to calculate the value of each state.
2. Brighter colors indicate higher values (better states to be in).
3. The agent always moves to the neighboring state with the highest value. The orange arrow in each cell shows the current greedy action according to the values.
Value Iteration: Learning the Optimal Path
Value Iteration is an algorithm that helps us find the best possible path through our environment. It repeatedly applies the Bellman equation until we converge on the optimal solution.
How Value Iteration Works
Initialize: Start with random or zero values for all states (except goal state = 1).
Update: For each state, apply the Bellman equation to update its value.
Repeat: Keep updating values until they stop changing significantly.
Result: The final values tell us which actions are best in each state.
Why It Works:
Value iteration exploits the fact that optimal decisions depend on the values of future states. By repeatedly updating state values based on neighboring states, we eventually find the optimal solution.
The Bellman Optimality Equation
Unlike the regular Bellman equation, the optimality equation focuses on finding the maximum value across all possible actions. This is what helps us find the best possible policy.
Watch the Algorithm in Action:
The grid below shows how state values update with each iteration. Colors indicate value magnitude - brighter colors mean higher values.
As you press "Step" or "Run Algorithm", watch how the values spread from the goal (bottom-right) throughout the grid.
The Convergence Graph:
This chart shows how quickly the values stabilize. When the line approaches zero, it means our solution has converged and we've found the optimal policy.
Q-Learning: Learning from Experience
Q-Learning is a powerful reinforcement learning algorithm that learns directly from experience, without needing a model of the environment. It's called "model-free" learning.
What Makes Q-Learning Special?
Key Differences:
- No Model Required: Unlike value iteration, Q-learning doesn't need to know transition probabilities.
- Learns from Experience: The agent learns by trying actions and observing outcomes.
- Action-Values: Instead of state values, we learn Q-values for state-action pairs.
The Q-Learning Update Equation
How Q-Learning Works:
Choose Action: Select an action using an exploration strategy (like ε-greedy).
Take Action: Perform the selected action and observe reward and next state.
Update Q-Value: Adjust the Q-value based on the reward and estimated future value.
Repeat: Continue this process until the Q-values converge.
Q-Learning in Action
In this grid world, our agent has no prior knowledge. It learns by exploring the environment and updating Q-values based on rewards and observed transitions.
What to Watch For:
The agent initially makes random moves (exploration). As it learns, it starts making better decisions. The chart below shows how total reward increases as the agent learns.