Q-Learning From Theory to Application
In the previous section, we learned the core ideas behind Q-Learning — what are states, actions, and how we use the Q-value update formula to teach an agent to make better decisions through trial and error.
Now, we’ll move beyond the theory and explore how Q-Learning actually works in practice — by implementing it in code using a simple Grid World environment.
You’ll see how the algorithm interacts with an environment, learns from rewards, and gradually discovers the optimal path to a goal.
Watch the Introduction
Here are some short explainer videos to help you visualize Q-Learning before we dive into the code.
Also You can Refer this Book for in depth Concepts of Reinforcement Learning Refer to Qlearning Section of the Book.
Section 2: Q-Learning in Action
In this Section We will be Looking example of Qlearning
Step 1: Understanding the Grid World
Imagine a 4x4 grid world environment.
- The agent starts at the top-left corner
(0,0) - The goal is to reach the bottom-right corner
(3,3) - Some cells are obstacles that the agent cannot move into
- The agent can move in 4 directions: up, down, left, right
- Rewards:
- +10 for reaching the goal
- -1 for each step (to encourage faster paths)
- -10 for hitting an obstacle (optional)
You can visualize the grid as:
S . . .
. # . .
. . # .
. . . G
S = Start, G = Goal, # = Obstacle
Step 2: Define Environment Parameters
import numpy as np
import random
# Grid dimensions
GRID_SIZE = 4
# Actions
ACTIONS = [ 'up', 'down', 'left', 'right' ]
ACTION_COUNT = len(ACTIONS)
# Rewards
GOAL_STATE = (3, 3)
OBSTACLES = [(1, 1), (2, 2)]
Explanation:
- We define a 4x4 grid → total 16 states.
- Each state is represented as
(row, col). - The agent can move in 4 directions.
- Obstacles block movement.
- The goal gives a positive reward.
Step 3: Helper Functions
def is_valid_state(state):
"""Check if the state is inside the grid and not an obstacle."""
r, c = state
if r < 0 or r >= GRID_SIZE or c < 0 or c >= GRID_SIZE:
return False
if state in OBSTACLES:
return False
return True
def get_next_state(state, action):
"""Return the next state given the current state and action."""
r, c = state
if action == 'up':
r -= 1
elif action == 'down':
r += 1
elif action == 'left':
c -= 1
elif action == 'right':
c += 1
new_state = (r, c)
if not is_valid_state(new_state):
return state # Stay in same place if move invalid
return new_state
def get_reward(state):
"""Reward function based on the agent’s current state."""
if state == GOAL_STATE:
return 10
elif state in OBSTACLES:
return -10
else:
return -1
Explanation:
is_valid_state()ensures the agent doesn’t move outside the grid or into obstacles.get_next_state()defines how actions change the agent’s position.get_reward()returns the appropriate feedback after each move.
Step 4: Initialize the Q-table
Q = {}
for row in range(GRID_SIZE):
for col in range(GRID_SIZE):
Q[(row, col)] = {a: 0 for a in ACTIONS}
Explanation:
- Each key in
Qis a state(row, col). - Each value is a dictionary mapping actions → Q-values.
- Initially, all values are
0.
Step 5: Training the Agent (Q-learning Algorithm)
# Hyperparameters
alpha = 0.8 # Learning rate
gamma = 0.95 # Discount factor
epsilon = 0.2 # Exploration rate
episodes = 500 # Number of training episodes
for ep in range(episodes):
state = (0, 0) # Start state
done = False
while not done:
# 1. Choose action (ε-greedy policy)
if random.uniform(0, 1) < epsilon:
action = random.choice(ACTIONS) # Explore
else:
action = max(Q[state], key=Q[state].get) # Exploit
# 2. Take action and observe outcome
next_state = get_next_state(state, action)
reward = get_reward(next_state)
# 3. Apply Q-learning formula
old_value = Q[state][action]
next_max = max(Q[next_state].values())
new_value = old_value + alpha * (reward + gamma * next_max - old_value)
Q[state][action] = new_value
# 4. Move to next state
state = next_state
# 5. End if goal reached
if state == GOAL_STATE:
done = True
Explanation:
| Step | Code Part | Purpose |
|---|---|---|
| 1 | ε-greedy policy | Balances exploration and exploitation |
| 2 | Take action & observe reward | Simulates environment response |
| 3 | Q-update formula | Learns from experience |
| 4 | Move to next state | Progress through episode |
| 5 | Goal check | Ends the episode |
Step 6: Testing the Learned Policy
After training, let’s test what the agent has learned.
state = (0, 0)
path = [state]
while state != GOAL_STATE:
action = max(Q[state], key=Q[state].get)
state = get_next_state(state, action)
path.append(state)
print("Learned Path to Goal:")
print(path)
Explanation:
- The agent now follows the best-known action from each state.
- The resulting path should be the shortest route avoiding obstacles.
Tip : Try and Tweak out Obstacle,Target postion and Hyper parameter and Number Episodes to see how the agent reacts to the new Environment.Try to plot the Reward Curve to trackk the agents Learning Curve.
Step 7: Exercise
Try These:
- Add another obstacle and retrain the agent.
- Change the step penalty to
-2and see how behavior changes. - Plot total reward per episode to visualize learning progress.
- Reduce ε (exploration rate) gradually each episode.
Bonus Challenge:
Expand the grid to 6x6 and observe how learning time increases.