Bellman Equation Visualizer

An interactive guide to understanding reinforcement learning's fundamental equation

What is the Bellman Equation?

The Bellman equation formalizes the intuition that a good state is one that gives you reward now and leads to good states later.

Value(state) = immediate reward + discounted value(next state)

Picture navigating a city: the intersection you choose depends on how pleasant the street is right now and how promising the subsequent neighborhoods look.

How to Use This Visualizer

Understand the pieces: Hover and highlight each component of the Bellman equation.
Tune the world: Adjust parameters like the discount factor and watch the values reshape.
Simulate updates: Step through value iteration or let it run automatically to convergence.
Compare algorithms: Switch to Q-learning to see how model-free updates differ.

The Bellman Equations: Simplified

Think of the Bellman equation as a decision-making recipe. It helps an agent figure out the best actions to take in different situations.

Start with a State (Where am I?)

Consider all possible Actions (What can I do?)

Calculate Immediate Rewards (What do I get now?)

Estimate Future Value (What might I get later?)

Choose the Best Action (What maximizes my total value?)

Breaking Down the Bellman Equation

State (s)

The current situation of the agent (like a position on a grid).

Action (a)

Something the agent can do (like moving up, down, left, or right).

Reward (R)

Immediate feedback (positive or negative) after taking an action.

Next State (s')

Where the agent ends up after taking an action.

Discount Factor (γ)

How much future rewards are valued compared to immediate ones (0-1).

Value Function

The expected total reward starting from a state.

Show full equations

Key Insight:

The value of any state can be calculated by considering the immediate reward plus the discounted value of the next state.

How the Equation Works:

For each possible action from our current state, calculate the immediate reward.

For each possible next state, calculate its value.

Discount future values (multiply by the discount factor).

Add immediate reward to the discounted future value.

Choose the action that gives the highest total value.

Discount Factor (γ): 0.9

Grid World Example: See Bellman in Action

Goal: Navigate from start (top-left) to goal (bottom-right) avoiding obstacles.

Reward: +1 for reaching the goal, -0.01 for each step (encourages efficiency).

Ready

What's Happening:

1. The agent uses value iteration (based on Bellman equations) to calculate the value of each state.

2. Brighter colors indicate higher values (better states to be in).

3. The agent always moves to the neighboring state with the highest value. The orange arrow in each cell shows the current greedy action according to the values.

Value Iteration: Learning the Optimal Path

Value Iteration is an algorithm that helps us find the best possible path through our environment. It repeatedly applies the Bellman equation until we converge on the optimal solution.

How Value Iteration Works

Initialize: Start with random or zero values for all states (except goal state = 1).

Update: For each state, apply the Bellman equation to update its value.

Repeat: Keep updating values until they stop changing significantly.

Result: The final values tell us which actions are best in each state.

Why It Works:

Value iteration exploits the fact that optimal decisions depend on the values of future states. By repeatedly updating state values based on neighboring states, we eventually find the optimal solution.

The Bellman Optimality Equation

Unlike the regular Bellman equation, the optimality equation focuses on finding the maximum value across all possible actions. This is what helps us find the best possible policy.

Watch the Algorithm in Action:

The grid below shows how state values update with each iteration. Colors indicate value magnitude - brighter colors mean higher values.

As you press "Step" or "Run Algorithm", watch how the values spread from the goal (bottom-right) throughout the grid.

Ready

The Convergence Graph:

This chart shows how quickly the values stabilize. When the line approaches zero, it means our solution has converged and we've found the optimal policy.

Q-Learning: Learning from Experience

Q-Learning is a powerful reinforcement learning algorithm that learns directly from experience, without needing a model of the environment. It's called "model-free" learning.

What Makes Q-Learning Special?

                    Key Differences:
                    No Model Required: Unlike value iteration, Q-learning doesn't need to know transition probabilities.
Learns from Experience: The agent learns by trying actions and observing outcomes.
Action-Values: Instead of state values, we learn Q-values for state-action pairs.

                

The Q-Learning Update Equation

Q(s,a)

The value of taking action 'a' in state 's'

Learning Rate (α)

How quickly new information overrides old information (0-1)

Exploration Rate (ε)

How often the agent tries random actions to discover new strategies (0-1)

How Q-Learning Works:

Choose Action: Select an action using an exploration strategy (like ε-greedy).

Take Action: Perform the selected action and observe reward and next state.

Update Q-Value: Adjust the Q-value based on the reward and estimated future value.

Repeat: Continue this process until the Q-values converge.

Learning Rate (α): 0.1

Exploration Rate (ε): 0.1

Q-Learning in Action

In this grid world, our agent has no prior knowledge. It learns by exploring the environment and updating Q-values based on rewards and observed transitions.

What to Watch For:

The agent initially makes random moves (exploration). As it learns, it starts making better decisions. The chart below shows how total reward increases as the agent learns.