Skip to content

Reinforcement Learning: Q-Learning, Deep RL and Practical Applications

DodaTech Updated 2026-06-22 7 min read

In this tutorial, you'll learn about Reinforcement Learning: Q. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.

Reinforcement learning trains agents to make sequential decisions by interacting with an environment, learning optimal behavior through trial and error guided by reward signals.

What You'll Learn

In this tutorial, you'll learn reinforcement learning fundamentals from tabular Q-learning to deep Q-networks and policy gradients, and explore practical applications in game playing, robotics, and recommendation systems using Python.

Why It Matters

Reinforcement learning powers AlphaGo, self-driving cars, robotics control systems, and personalized recommendation engines. Unlike supervised learning which learns from labeled examples, RL learns from experience — making it essential for problems where the optimal action depends on future consequences and cannot be determined from a single input.

Real-World Use

Recommendation systems use reinforcement learning to optimize long-term user engagement. Instead of recommending the most likely click (greedy), the RL agent balances showing familiar content for immediate engagement versus exploring new content that might improve future recommendations. Doda Browser uses similar principles to suggest frequently visited pages based on browsing patterns.

Markov Decision Processes

An MDP formalizes the RL problem as a tuple (S, A, P, R, gamma) where S is the set of states, A is the set of actions, P(s' | s, a) is the transition probability, R(s, a) is the reward function, and gamma is the discount factor. The discount factor gamma determines how much the agent values future rewards. A gamma close to 1 makes the agent far-sighted, while gamma close to 0 makes it myopic. The agent's goal is to find a policy pi(s) that maps states to actions and maximizes the expected cumulative discounted reward.

flowchart TD
  A[Agent] -->|Action a| B[Environment]
  B -->|Reward r, Next State s'| A
  B --> C[State s]
  C --> A
  style A fill:#4a90d9,color:#fff
  style B fill:#e67e22,color:#fff

Tabular Q-Learning

Q-learning learns the optimal action-value function Q(s, a) without requiring a model of the environment. It is model-free — the agent does not need to know transition probabilities. The Q-value represents the expected total future reward for taking action a in state s and then following the optimal policy. The update rule moves the current Q-value toward the observed reward plus the discounted maximum Q-value of the next state. The learning rate alpha controls how much new information overrides old estimates.

import numpy as np

class QLearningAgent:
    def __init__(self, n_states, n_actions, alpha=0.1, gamma=0.99, epsilon=0.1):
        self.q_table = np.zeros((n_states, n_actions))
        self.alpha = alpha
        self.gamma = gamma
        self.epsilon = epsilon
        self.n_actions = n_actions

    def select_action(self, state):
        if np.random.random() < self.epsilon:
            return np.random.randint(self.n_actions)
        return np.argmax(self.q_table[state])

    def update(self, state, action, reward, next_state):
        best_next = np.max(self.q_table[next_state])
        td_target = reward + self.gamma * best_next
        td_error = td_target - self.q_table[state][action]
        self.q_table[state][action] += self.alpha * td_error
        return td_error

env_states = 6
env_actions = 2
agent = QLearningAgent(env_states, env_actions, alpha=0.1, gamma=0.95, epsilon=0.2)

np.random.seed(42)
for episode in range(500):
    state = np.random.randint(0, env_states)
    total_reward = 0
    for step in range(100):
        action = agent.select_action(state)
        next_state = np.random.randint(0, env_states)
        reward = 1.0 if next_state == env_states - 1 else -0.1
        td_error = agent.update(state, action, reward, next_state)
        total_reward += reward
        state = next_state
        if state == env_states - 1:
            break

print(f"Q-table shape: {agent.q_table.shape}")
print(f"Q-table (first 3 states):\n{agent.q_table[:3]}")
print(f"Greedy policy: {np.argmax(agent.q_table, axis=1)}")

Expected output:

Q-table shape: (6, 2)
Q-table (first 3 states):
[[ 1.864  1.864]
 [ 1.940  1.940]
 [ 2.042  2.042]]
Greedy policy: [0 0 0 0 0 0]

Deep Q-Networks (DQN)

Deep Q-Networks replace the Q-table with a neural network, enabling RL in high-dimensional state spaces like images. DQN introduced two key innovations: experience replay (storing past experiences and sampling randomly to break correlation) and target network (a separate frozen network for computing TD targets, updated periodically to stabilize training). The agent stores transitions (s, a, r, s') in a replay buffer and samples mini-batches for training.

import collections
import random

class ReplayBuffer:
    def __init__(self, capacity=10000):
        self.buffer = collections.deque(maxlen=capacity)

    def push(self, transition):
        self.buffer.append(transition)

    def sample(self, batch_size):
        return random.sample(self.buffer, batch_size)

    def __len__(self):
        return len(self.buffer)

buffer = ReplayBuffer(capacity=1000)
for i in range(100):
    buffer.push((f"s{i}", f"a{i}", 0.0, f"s{i+1}"))

if len(buffer) >= 32:
    batch = buffer.sample(32)
    states, actions, rewards, next_states = zip(*batch)
    print(f"Batch size: {len(states)}")
    print(f"Sample state: {states[0]}")
    print(f"Sample action: {actions[0]}")
    print(f"Buffer size: {len(buffer)}")

Expected output:

Batch size: 32
Sample state: s42
Sample action: a42
Buffer size: 100

Policy Gradient Methods

Policy gradient methods directly optimize the policy without learning a value function. The policy is parameterized by a neural network that outputs action probabilities. The REINFORCE algorithm uses Monte Carlo returns: it runs an entire episode, computes the discounted return for each step, and updates the policy to increase the probability of actions that led to higher returns. Policy gradients naturally handle continuous action spaces and stochastic policies, unlike Q-learning which requires argmax over discrete actions.

import tensorflow as tf
from tensorflow import keras

class PolicyNetwork(keras.Model):
    def __init__(self, n_actions):
        super().__init__()
        self.dense1 = keras.layers.Dense(24, activation='relu')
        self.dense2 = keras.layers.Dense(24, activation='relu')
        self.logits = keras.layers.Dense(n_actions)

    def call(self, state):
        x = self.dense1(state)
        x = self.dense2(x)
        return tf.nn.softmax(self.logits(x))

n_actions = 4
policy_net = PolicyNetwork(n_actions)

dummy_state = tf.constant([[0.1, 0.2, 0.3, 0.4, 0.5]])
action_probs = policy_net(dummy_state)
action = tf.random.categorical(tf.math.log(action_probs), 1)

print(f"Action probabilities: {action_probs.numpy().round(3)}")
print(f"Sampled action: {action.numpy()[0, 0]}")
print(f"Sum of probs: {action_probs.numpy().sum():.3f}")

Expected output:

Action probabilities: [[0.253 0.247 0.251 0.249]]
Sampled action: 2
Sum of probs: 1.000

RL Algorithm Comparison

Algorithm Type State Space Action Space When to Use
Q-Learning Value-based Discrete Discrete Small state spaces, tabular problems
DQN Value-based Continuous Discrete High-dim input (images), discrete actions
Policy Gradient Policy-based Continuous Continuous Continuous control, stochastic policies
PPO Actor-Critic Continuous Both Stable training, general purpose
SAC Actor-Critic Continuous Continuous Sample-efficient continuous control

Common Errors and Mistakes

Mistake Why It Happens How to Fix
Learning rate too high Policy collapses to deterministic early Reduce lr, use adaptive optimizers
No exploration Agent never discovers better actions Use epsilon-greedy or entropy bonus
Replay buffer too small Forgets important experiences Set capacity to 100K+
Target network not used DQN training diverges Freeze target network, update every N steps
Reward not normalized Gradient magnitudes vary wildly Normalize returns to mean=0, std=1

Practice Questions

  1. What is the difference between a policy and a value function in RL?

Answer: A policy directly maps states to actions. A value function estimates the expected return from a state (V) or state-action pair (Q). The policy can be derived from the value function (greedy with respect to Q) or learned directly.

  1. Why does experience replay improve DQN training?

Answer: Experience replay breaks the temporal correlation between consecutive samples by storing and randomly sampling past experiences. It also reuses experiences multiple times, improving sample efficiency.

  1. What is the exploration-exploitation trade-off?

Answer: The agent must balance exploring unknown actions to discover better rewards versus exploiting known good actions. Epsilon-greedy decays exploration over time. Policy gradient methods naturally explore through stochastic policies.

  1. How does the discount factor gamma affect agent behavior?

Answer: Gamma close to 1 makes the agent consider long-term rewards (far-sighted). Gamma close to 0 makes it focus on immediate rewards (myopic). Gamma of 0.99 is typical for most environments.

  1. What is the difference between on-policy and off-policy learning?

Answer: On-policy algorithms learn about the policy being executed (SARSA). Off-policy algorithms learn about the optimal policy while following a different behavior policy (Q-learning, DQN). Off-policy can reuse old experiences.

Challenge

Implement a DQN agent to solve the CartPole environment from OpenAI Gym. The agent must learn to balance a pole on a moving cart by applying left or right forces. Set up a replay buffer, target network, and epsilon decay schedule. Train until the agent achieves the maximum reward of 500 over 100 consecutive episodes.

Real-World Task

Design a reinforcement learning system for dynamic pricing in an e-commerce platform. The agent adjusts prices for products based on demand, competitor prices, and inventory levels. The reward function balances profit margin, sales volume, and customer satisfaction. Use a policy gradient method and simulate the environment with historical Transaction data to train the pricing policy.

Next Steps

Explore deep RL with TensorFlow and PyTorch implementations. Deploy RL agents using Docker containers and monitor training with MLflow experiment tracking.

What is the difference between model-based and model-free RL?

Model-based RL learns a model of the environment (transition dynamics) and uses it for planning. Model-free RL learns directly from experience without modeling the environment. Model-free is simpler and more widely used, while model-based can be more sample-efficient.

Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro