Skip to content

What is Reinforcement Learning? Explained with Python Examples

DodaTech 2 min read

In this tutorial, you'll learn about What is Reinforcement Learning? Explained with Python Examples. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.

What You'll Learn

Understand reinforcement learning fundamentals — agents, environments, rewards, policies — and build a Q-learning agent that learns to navigate a grid.

Why It Matters

Reinforcement learning powers AlphaGo, self-driving cars, robotics, game AI, and autonomous trading systems.

Real-World Use

Training a robot to walk, optimizing data center cooling (Google saved 40% energy with RL), and teaching game AIs to beat human champions.

What is Reinforcement Learning?

Reinforcement learning (RL) is a type of ML where an agent learns by taking actions and receiving rewards — like training a dog with treats.

Agent → Takes action → Environment → Returns reward + new state
Agent ← Learns from reward ← Environment

The agent's goal: maximize total reward over time.

Key Concepts

Concept Definition Example
Agent The learner/decision-maker A game player
Environment The world the agent interacts with The game board
Action What the agent can do Move left, right, up, down
State Current situation Player position
Reward Feedback signal +1 for reaching goal, -1 for falling
Policy Strategy for choosing actions "Always go toward the goal"

Q-Learning from Scratch

Let's build an agent that learns to navigate a 5x5 grid to reach a goal.

import numpy as np

# Grid: 0=empty, 1=obstacle, 2=goal
grid = np.array([
    [0, 0, 0, 0, 0],
    [0, 1, 1, 0, 0],
    [0, 0, 0, 0, 1],
    [0, 1, 0, 1, 0],
    [0, 0, 0, 0, 2]
])

# Q-table: (row, col) -> (up, down, left, right)
q_table = np.zeros((5, 5, 4))

actions = {0: (-1, 0), 1: (1, 0), 2: (0, -1), 3: (0, 1)}
learning_rate = 0.1
discount = 0.95
episodes = 1000

for _ in range(episodes):
    state = (0, 0)
    while grid[state] != 2:
        action = np.argmax(q_table[state[0], state[1]])
        dr, dc = actions[action]
        new_state = (state[0] + dr, state[1] + dc)

        # Check bounds and obstacles
        if (0 <= new_state[0] < 5 and 0 <= new_state[1] < 5
                and grid[new_state] != 1):
            reward = 1 if grid[new_state] == 2 else -0.01
            # Q-learning update
            best_next = np.max(q_table[new_state[0], new_state[1]])
            q_table[state[0], state[1], action] += learning_rate * (
                reward + discount * best_next -
                q_table[state[0], state[1], action]
            )
            state = new_state
        else:
            # Penalize invalid moves
            q_table[state[0], state[1], action] -= 0.1

print("Training complete!")

When to Use RL

Good fit Poor fit
Sequential decision-making One-shot predictions
Environment is a simulator Real-world with slow feedback
Exploration is safe Mistakes are expensive

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro