Q-Learning and Temporal Difference Learning: How Agents Actually Learn

Sep 19, 2025

The Story: Learning by Doing

Imagine a child learning to ride a bicycle. No one gives them the full physics equations of balance and motion. Instead, they try, wobble, sometimes fall, and adjust based on experience.

This is exactly how RL agents learn: not by knowing the full world model, but by interacting with it.

Dynamic Programming assumes we know transition probabilities P(s′ ∣ s, a).
In the real world, that’s impossible.
Temporal Difference (TD) Learning and Q-Learning are the methods that let agents learn directly from sampled experience.

They are the bridge between theory (Bellman equations) and practice (learning online from data).

Step 1. Bootstrapping: Learning from Partial Futures

The central trick is bootstrapping: instead of waiting until the very end of an episode to compute the full return G_t, TD methods update value estimates using:

That is, the value of the current state is nudged toward the immediate reward plus the discounted estimate of the next state’s value.

This makes TD updates online and incremental.

Visual:

Step 2. TD(0) Update Rule

The simplest TD method is TD(0):

α: learning rate (0 < α ≤ 1).
Bracketed term = TD error:

Interpretation: Adjust the old value estimate by a fraction α of the surprise (prediction error).

Numerical Example 1: TD State Update

Suppose:

Current estimate: V(s) = 2.0.
Next state’s value: V(s’) = 3.0.
Reward: r = 1.
Parameters: γ = 0.9, α = 0.5.

Step 1: Compute target

Step 2: Compute TD error

Step 3: Update

Interpretation: The agent thought the state was worth 2.0, but experience suggests it’s closer to 3.7. After the update, it moves halfway there (2.85). Over many updates, values converge.

Visual:

Step 3. From Values to Q-Values

Instead of just evaluating states, agents often need to evaluate state-action pairs.

The optimal Q-function satisfies the Bellman optimality equation:

TD learning turns this into an update rule:

This is Q-learning:

Uses real experience.
Bootstraps with the max operator.
Converges to Q* under conditions.

Visual:

Numerical Example 2: Q-Learning Update

Suppose:

Current state: s.
Action a chosen.
Reward received: r = 2.
Discount γ = 0.9, learning rate α = 0.5.
Current Q-value: Q(s, a) = 1.0.
Next state’s best action value: max_{a’} Q(s’, a’) = 4.0.

Step 1: Compute target

Step 2: Compute TD error

Step 3: Update

Interpretation: The agent drastically increases its estimate of action a’s value, from 1.0 → 3.3, because the observed return was far higher than expected. With more updates, Q-values converge toward the optimal policy.

Visual:

Step 4. Why TD and Q-Learning Matter

No model needed: Learn directly from raw experience.
Efficient: Update online instead of waiting for full episodes.
Bootstrapping: Use estimates to improve themselves.
Foundation: Modern Deep RL (DQN, Actor–Critic) is built on these principles.

Analogy: You don’t need to play every possible chess game. If you see one line leads to disaster, you immediately adjust your value of that move.

Special: Learn More in Learning to Learn: Reinforcement Learning Explained for Humans

If this post clicked, and you want to master TD learning step-by-step, with more examples, derivations, and even code, check out my book:

Learning to Learn: Reinforcement Learning Explained for Humans

It covers:

TD learning explained with analogies and math.
Worked problems like the ones we solved above.
Code snippets to implement TD(0) and Q-learning.
Bridges from classical tabular RL to deep RL.

Grab it here:

Closing

We introduced TD learning: incremental updates from experience.
We extended to Q-learning, the most widely used value-based method.
We worked through two detailed numerical examples:
TD state update (2.0 → 2.85).
Q-learning update (1.0 → 3.3).
We saw why TD methods are the engine that powers practical RL.

Next post → Policy Gradients & Actor–Critic, where we move from value-based to policy-based learning.

Follow and Share

You can follow me on Medium to read more: https://medium.com/@satyamcser

#ReinforcementLearning #TemporalDifference #QLearning #MachineLearning #BellmanEquation #TDLearning #Bootstrapping #RLMath #AIExplained #DeepRL #satmis

Thanks for reading! This post is public so feel free to share it.

Q-Learning and Temporal Difference Learning: How Agents Actually Learn

The Story: Learning by Doing

Step 1. Bootstrapping: Learning from Partial Futures

Step 2. TD(0) Update Rule

Numerical Example 1: TD State Update

Step 3. From Values to Q-Values

Numerical Example 2: Q-Learning Update

Step 4. Why TD and Q-Learning Matter

Special: Learn More in Learning to Learn: Reinforcement Learning Explained for Humans

Closing

Follow and Share

You can follow me on Medium to read more: https://medium.com/@satyamcser

Discussion about this post