The Core Math of Reinforcement Learning: Returns, Values, and the Bellman Equation

Sep 16, 2025

Setting the Scene: Why Do Agents Look Ahead?

Imagine a child playing Mario. If they only cared about this second, they’d grab the nearest coin and maybe fall into a pit right after. But if they cared about the long-term, they’d plan: jump carefully, avoid enemies, and maybe win the level.

That’s what Reinforcement Learning (RL) formalizes: the balance between short-term gratification and long-term planning.

The math of RL revolves around three intertwined ideas:

Returns: adding up rewards into the future.
Value Functions: predicting how good states or actions are.
The Bellman Equation: a recursion that forces agents to think forward.

Let’s go step by step.

Step 1. The Return: Summing the Future

Formally, the return from time step t is:

Where:

R_{t+k+1} is the reward received k steps into the future.
γ ∈ [0, 1] is the discount factor, controlling how much the agent cares about the future.

Interpretation:

γ = 0 → live in the moment (myopic).
γ = 1 → care equally about all future rewards.

Numerical Example 1: A Shortcut vs Safe Path

Suppose an agent (a delivery robot) has two paths:

Shortcut: Rewards = [+5 now, then +0, but with 20% chance of -20 in the next step].
Safe Path: Rewards = [+2, +2, +2, always].

Let γ = 0.9.

Safe Path Return:

Shortcut Expected Return:

Interpretation: Although the shortcut looks tempting (big +5 upfront), the discounted expectation makes it worse than the safe path (1 < 5.42).

Visual:

Step 2. Value Functions: Predicting the Future

Instead of recalculating full returns each time, the agent builds functions:

State value:

Action value:

Intuition:

V^\pi(s): “How good is it to be in state s if I keep following my policy?”
Q^\pi(s, a): “How good is it to take action a in state s, then follow my policy?”

These compress the entire future tree into single numbers, powerful abstractions for planning.

Numerical Example 2: Gridworld Step

Suppose an agent in state s has two possible actions:

Action a_1: Reward = +1, moves to terminal state.
Action a_2: Reward = 0, but leads to another state worth V(s’) = 5.

With γ = 0.9:

So:

Interpretation: Even though action a_1 gives immediate satisfaction (+1), the agent learns to prefer a_2, which unlocks a much better future (worth 4.5).

Visual:

Step 3. Bellman Recursion: The Self-Referential Genius

The Bellman equation expresses this recursive reasoning:

For value functions:

For optimal value functions:

Interpretation:

The value of today depends on reward now + discounted value of tomorrow.
Like a recursive definition of “goodness”.

Numerical Example 3: Bellman Update

Consider a simple MDP with:

State s.
Two possible next states:
s_1: reward = 2, probability = 0.5, V(s_1) = 4.
s_2: reward = 0, probability = 0.5, V(s_2) = 6.
Discount factor: γ = 0.9.
Policy: deterministic (always takes action aaa).

Using Bellman equation:

Interpretation: The agent assigns state sss a value of 5.5, because it averages over uncertain futures, weighting rewards and next-state values.

Visual:

Importance of it: From Chess to Driving

In chess, the “value” of a board position is not just the current material but the promised outcomes from good play.
In self-driving cars, the “value” of being at an intersection is not only safety now but the probability of reaching the destination.

This recursive planning is the heart of RL.

Special: Learn More in Learning to Learn: Reinforcement Learning Explained for Humans

If today’s math feels like both beautiful and slightly overwhelming, that’s exactly why I wrote my book:

Learning to Learn: Reinforcement Learning Explained for Humans

It takes you step by step, from everyday analogies (like the biking shortcut problem above) to rigorous mathematics, all while keeping the storytelling alive.

Grab it here:

Closing

We started with returns: how agents add up futures.
We built value functions: compressing expectations into numbers.
We introduced the Bellman recursion: the mathematical law that forces agents to think about tomorrow today, with a worked example that shows exactly how an agent averages over uncertain futures.

Next time: we dive into the eternal dilemma, Exploration vs Exploitation.

Follow and Share

You can follow me on Medium to read more: https://medium.com/@satyamcser

#ReinforcementLearning #MachineLearning #BellmanEquation #ValueFunctions #Discounting #Gridworld #AIExplained #MLForHumans #RLMath #satmis

Thanks for reading! This post is public so feel free to share it.

The Core Math of Reinforcement Learning: Returns, Values, and the Bellman Equation

Setting the Scene: Why Do Agents Look Ahead?

Step 1. The Return: Summing the Future

Numerical Example 1: A Shortcut vs Safe Path

Step 2. Value Functions: Predicting the Future

Numerical Example 2: Gridworld Step

Step 3. Bellman Recursion: The Self-Referential Genius

Numerical Example 3: Bellman Update

Importance of it: From Chess to Driving

Special: Learn More in Learning to Learn: Reinforcement Learning Explained for Humans

Closing

Follow and Share

You can follow me on Medium to read more: https://medium.com/@satyamcser

Discussion about this post