Bellman Equation Visualizer
Main Site

Bellman Equation Visualizer

An interactive guide to understanding reinforcement learning's fundamental equation

What is the Bellman Equation?

The Bellman equation forms the mathematical foundation of reinforcement learning. In simple terms, it states:

The value of your current state = immediate reward + discounted value of future states

Think of it like planning a route through a city. At each intersection (state), you decide which way to go (action) based on how good the immediate path looks (reward) and where it leads you in the long run (future value).

How to Use This Visualizer

  1. Learn the Basics: Start with the Bellman Equation section to understand the core formula.
  2. Adjust Parameters: Use the sliders to see how changing values (like discount factor) affects learning.
  3. Run Simulations: Step through algorithms to see how agents learn optimal policies.
  4. Observe Results: Watch the charts to understand convergence and reward patterns.

The Bellman Equations: Simplified

Think of the Bellman equation as a decision-making recipe. It helps an agent figure out the best actions to take in different situations.

Start with a State (Where am I?)
Consider all possible Actions (What can I do?)
Calculate Immediate Rewards (What do I get now?)
Estimate Future Value (What might I get later?)
Choose the Best Action (What maximizes my total value?)

Breaking Down the Bellman Equation

State (s)
The current situation of the agent (like a position on a grid).
Action (a)
Something the agent can do (like moving up, down, left, or right).
Reward (R)
Immediate feedback (positive or negative) after taking an action.
Next State (s')
Where the agent ends up after taking an action.
Discount Factor (γ)
How much future rewards are valued compared to immediate ones (0-1).
Value Function
The expected total reward starting from a state.

Key Insight:

The value of any state can be calculated by considering the immediate reward plus the discounted value of the next state.

How the Equation Works:

For each possible action from our current state, calculate the immediate reward.

For each possible next state, calculate its value.

Discount future values (multiply by the discount factor).

Add immediate reward to the discounted future value.

Choose the action that gives the highest total value.

Discount Factor (γ): 0.9
Higher values (closer to 1) make future rewards more important. Lower values focus on immediate rewards.

Grid World Example: See Bellman in Action

Goal: Navigate from start (top-left) to goal (bottom-right) avoiding obstacles.

Reward: +1 for reaching the goal, -0.01 for each step (encourages efficiency).

Ready

What's Happening:

1. The agent uses value iteration (based on Bellman equations) to calculate the value of each state.

2. Brighter colors indicate higher values (better states to be in).

3. The agent always moves to the neighboring state with the highest value. The orange arrow in each cell shows the current greedy action according to the values.

Value Iteration: Learning the Optimal Path

Value Iteration is an algorithm that helps us find the best possible path through our environment. It repeatedly applies the Bellman equation until we converge on the optimal solution.

How Value Iteration Works

Initialize: Start with random or zero values for all states (except goal state = 1).

Update: For each state, apply the Bellman equation to update its value.

Repeat: Keep updating values until they stop changing significantly.

Result: The final values tell us which actions are best in each state.

Why It Works:

Value iteration exploits the fact that optimal decisions depend on the values of future states. By repeatedly updating state values based on neighboring states, we eventually find the optimal solution.

The Bellman Optimality Equation

Unlike the regular Bellman equation, the optimality equation focuses on finding the maximum value across all possible actions. This is what helps us find the best possible policy.

Watch the Algorithm in Action:

The grid below shows how state values update with each iteration. Colors indicate value magnitude - brighter colors mean higher values.

As you press "Step" or "Run Algorithm", watch how the values spread from the goal (bottom-right) throughout the grid.

Ready

The Convergence Graph:

This chart shows how quickly the values stabilize. When the line approaches zero, it means our solution has converged and we've found the optimal policy.

Q-Learning: Learning from Experience

Q-Learning is a powerful reinforcement learning algorithm that learns directly from experience, without needing a model of the environment. It's called "model-free" learning.

What Makes Q-Learning Special?

Key Differences:

  • No Model Required: Unlike value iteration, Q-learning doesn't need to know transition probabilities.
  • Learns from Experience: The agent learns by trying actions and observing outcomes.
  • Action-Values: Instead of state values, we learn Q-values for state-action pairs.

The Q-Learning Update Equation

Q(s,a)
The value of taking action 'a' in state 's'
Learning Rate (α)
How quickly new information overrides old information (0-1)
Exploration Rate (ε)
How often the agent tries random actions to discover new strategies (0-1)

How Q-Learning Works:

Choose Action: Select an action using an exploration strategy (like ε-greedy).

Take Action: Perform the selected action and observe reward and next state.

Update Q-Value: Adjust the Q-value based on the reward and estimated future value.

Repeat: Continue this process until the Q-values converge.

Learning Rate (α): 0.1
Higher values learn faster but may become unstable. Lower values learn more slowly but more reliably.
Exploration Rate (ε): 0.1
Higher values explore more random actions. Lower values exploit known good actions.

Q-Learning in Action

In this grid world, our agent has no prior knowledge. It learns by exploring the environment and updating Q-values based on rewards and observed transitions.

What to Watch For:

The agent initially makes random moves (exploration). As it learns, it starts making better decisions. The chart below shows how total reward increases as the agent learns.