Q-Learning From Theory to Application

In the previous section, we learned the core ideas behind Q-Learning — what are states, actions, and how we use the Q-value update formula to teach an agent to make better decisions through trial and error.

Now, we’ll move beyond the theory and explore how Q-Learning actually works in practice — by implementing it in code using a simple Grid World environment.
You’ll see how the algorithm interacts with an environment, learns from rewards, and gradually discovers the optimal path to a goal.

Watch the Introduction

Here are some short explainer videos to help you visualize Q-Learning before we dive into the code.

Video 1: Qlearning

Video 2: In Detailed Explaination of Qlearning

Also You can Refer this Book for in depth Concepts of Reinforcement Learning Refer to Qlearning Section of the Book.

Section 2: Q-Learning in Action

In this Section We will be Looking example of Qlearning

Step 1: Understanding the Grid World

Imagine a 4x4 grid world environment.

The agent starts at the top-left corner (0,0)
The goal is to reach the bottom-right corner (3,3)
Some cells are obstacles that the agent cannot move into
The agent can move in 4 directions: up, down, left, right
Rewards:
- +10 for reaching the goal
- -1 for each step (to encourage faster paths)
- -10 for hitting an obstacle (optional)

You can visualize the grid as:

S  .  .  .
.  #  .  .
.  .  #  .
.  .  .  G

S = Start, G = Goal, # = Obstacle

Step 2: Define Environment Parameters

import numpy as np
import random

# Grid dimensions
GRID_SIZE = 4

# Actions
ACTIONS = [ 'up', 'down', 'left', 'right' ] 
ACTION_COUNT = len(ACTIONS)

# Rewards
GOAL_STATE = (3, 3)
OBSTACLES = [(1, 1), (2, 2)]

Explanation:

We define a 4x4 grid → total 16 states.
Each state is represented as (row, col).
The agent can move in 4 directions.
Obstacles block movement.
The goal gives a positive reward.

Step 3: Helper Functions

def is_valid_state(state):
    """Check if the state is inside the grid and not an obstacle."""
    r, c = state
    if r < 0 or r >= GRID_SIZE or c < 0 or c >= GRID_SIZE:
        return False
    if state in OBSTACLES:
        return False
    return True


def get_next_state(state, action):
    """Return the next state given the current state and action."""
    r, c = state
    if action == 'up':
        r -= 1
    elif action == 'down':
        r += 1
    elif action == 'left':
        c -= 1
    elif action == 'right':
        c += 1

    new_state = (r, c)
    if not is_valid_state(new_state):
        return state  # Stay in same place if move invalid
    return new_state


def get_reward(state):
    """Reward function based on the agent’s current state."""
    if state == GOAL_STATE:
        return 10
    elif state in OBSTACLES:
        return -10
    else:
        return -1

Explanation:

is_valid_state() ensures the agent doesn’t move outside the grid or into obstacles.
get_next_state() defines how actions change the agent’s position.
get_reward() returns the appropriate feedback after each move.

Step 4: Initialize the Q-table

Q = {}
for row in range(GRID_SIZE):
    for col in range(GRID_SIZE):
        Q[(row, col)] = {a: 0 for a in ACTIONS}

Explanation:

Each key in Q is a state (row, col).
Each value is a dictionary mapping actions → Q-values.
Initially, all values are 0.

Step 5: Training the Agent (Q-learning Algorithm)

# Hyperparameters
alpha = 0.8     # Learning rate
gamma = 0.95    # Discount factor
epsilon = 0.2   # Exploration rate
episodes = 500  # Number of training episodes

for ep in range(episodes):
    state = (0, 0)  # Start state
    done = False

    while not done:
        # 1. Choose action (ε-greedy policy)
        if random.uniform(0, 1) < epsilon:
            action = random.choice(ACTIONS)   # Explore
        else:
            action = max(Q[state], key=Q[state].get)  # Exploit

        # 2. Take action and observe outcome
        next_state = get_next_state(state, action)
        reward = get_reward(next_state)

        # 3. Apply Q-learning formula
        old_value = Q[state][action]
        next_max = max(Q[next_state].values())

        new_value = old_value + alpha * (reward + gamma * next_max - old_value)
        Q[state][action] = new_value

        # 4. Move to next state
        state = next_state

        # 5. End if goal reached
        if state == GOAL_STATE:
            done = True

Explanation:

Step	Code Part	Purpose
1	ε-greedy policy	Balances exploration and exploitation
2	Take action & observe reward	Simulates environment response
3	Q-update formula	Learns from experience
4	Move to next state	Progress through episode
5	Goal check	Ends the episode

Step 6: Testing the Learned Policy

After training, let’s test what the agent has learned.

state = (0, 0)
path = [state]

while state != GOAL_STATE:
    action = max(Q[state], key=Q[state].get)
    state = get_next_state(state, action)
    path.append(state)

print("Learned Path to Goal:")
print(path)

Explanation:

The agent now follows the best-known action from each state.
The resulting path should be the shortest route avoiding obstacles.

Tip : Try and Tweak out Obstacle,Target postion and Hyper parameter and Number Episodes to see how the agent reacts to the new Environment.Try to plot the Reward Curve to trackk the agents Learning Curve.

Step 7: Exercise

Try These:

Add another obstacle and retrain the agent.
Change the step penalty to -2 and see how behavior changes.
Plot total reward per episode to visualize learning progress.
Reduce ε (exploration rate) gradually each episode.

Bonus Challenge:
Expand the grid to 6x6 and observe how learning time increases.

Keyboard shortcuts

eLSI: Sprint - 1