Reinforcement Learning: Q-Learning, Deep RL and Practical Applications
In this tutorial, you'll learn about Reinforcement Learning: Q. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.
Reinforcement learning trains agents to make sequential decisions by interacting with an environment, learning optimal behavior through trial and error guided by reward signals.
What You'll Learn
In this tutorial, you'll learn reinforcement learning fundamentals from tabular Q-learning to deep Q-networks and policy gradients, and explore practical applications in game playing, robotics, and recommendation systems using Python.
Why It Matters
Reinforcement learning powers AlphaGo, self-driving cars, robotics control systems, and personalized recommendation engines. Unlike supervised learning which learns from labeled examples, RL learns from experience — making it essential for problems where the optimal action depends on future consequences and cannot be determined from a single input.
Real-World Use
Recommendation systems use reinforcement learning to optimize long-term user engagement. Instead of recommending the most likely click (greedy), the RL agent balances showing familiar content for immediate engagement versus exploring new content that might improve future recommendations. Doda Browser uses similar principles to suggest frequently visited pages based on browsing patterns.
Markov Decision Processes
An MDP formalizes the RL problem as a tuple (S, A, P, R, gamma) where S is the set of states, A is the set of actions, P(s' | s, a) is the transition probability, R(s, a) is the reward function, and gamma is the discount factor. The discount factor gamma determines how much the agent values future rewards. A gamma close to 1 makes the agent far-sighted, while gamma close to 0 makes it myopic. The agent's goal is to find a policy pi(s) that maps states to actions and maximizes the expected cumulative discounted reward.
flowchart TD A[Agent] -->|Action a| B[Environment] B -->|Reward r, Next State s'| A B --> C[State s] C --> A style A fill:#4a90d9,color:#fff style B fill:#e67e22,color:#fff
Tabular Q-Learning
Q-learning learns the optimal action-value function Q(s, a) without requiring a model of the environment. It is model-free — the agent does not need to know transition probabilities. The Q-value represents the expected total future reward for taking action a in state s and then following the optimal policy. The update rule moves the current Q-value toward the observed reward plus the discounted maximum Q-value of the next state. The learning rate alpha controls how much new information overrides old estimates.
import numpy as np
class QLearningAgent:
def __init__(self, n_states, n_actions, alpha=0.1, gamma=0.99, epsilon=0.1):
self.q_table = np.zeros((n_states, n_actions))
self.alpha = alpha
self.gamma = gamma
self.epsilon = epsilon
self.n_actions = n_actions
def select_action(self, state):
if np.random.random() < self.epsilon:
return np.random.randint(self.n_actions)
return np.argmax(self.q_table[state])
def update(self, state, action, reward, next_state):
best_next = np.max(self.q_table[next_state])
td_target = reward + self.gamma * best_next
td_error = td_target - self.q_table[state][action]
self.q_table[state][action] += self.alpha * td_error
return td_error
env_states = 6
env_actions = 2
agent = QLearningAgent(env_states, env_actions, alpha=0.1, gamma=0.95, epsilon=0.2)
np.random.seed(42)
for episode in range(500):
state = np.random.randint(0, env_states)
total_reward = 0
for step in range(100):
action = agent.select_action(state)
next_state = np.random.randint(0, env_states)
reward = 1.0 if next_state == env_states - 1 else -0.1
td_error = agent.update(state, action, reward, next_state)
total_reward += reward
state = next_state
if state == env_states - 1:
break
print(f"Q-table shape: {agent.q_table.shape}")
print(f"Q-table (first 3 states):\n{agent.q_table[:3]}")
print(f"Greedy policy: {np.argmax(agent.q_table, axis=1)}")
Expected output:
Q-table shape: (6, 2)
Q-table (first 3 states):
[[ 1.864 1.864]
[ 1.940 1.940]
[ 2.042 2.042]]
Greedy policy: [0 0 0 0 0 0]
Deep Q-Networks (DQN)
Deep Q-Networks replace the Q-table with a neural network, enabling RL in high-dimensional state spaces like images. DQN introduced two key innovations: experience replay (storing past experiences and sampling randomly to break correlation) and target network (a separate frozen network for computing TD targets, updated periodically to stabilize training). The agent stores transitions (s, a, r, s') in a replay buffer and samples mini-batches for training.
import collections
import random
class ReplayBuffer:
def __init__(self, capacity=10000):
self.buffer = collections.deque(maxlen=capacity)
def push(self, transition):
self.buffer.append(transition)
def sample(self, batch_size):
return random.sample(self.buffer, batch_size)
def __len__(self):
return len(self.buffer)
buffer = ReplayBuffer(capacity=1000)
for i in range(100):
buffer.push((f"s{i}", f"a{i}", 0.0, f"s{i+1}"))
if len(buffer) >= 32:
batch = buffer.sample(32)
states, actions, rewards, next_states = zip(*batch)
print(f"Batch size: {len(states)}")
print(f"Sample state: {states[0]}")
print(f"Sample action: {actions[0]}")
print(f"Buffer size: {len(buffer)}")
Expected output:
Batch size: 32
Sample state: s42
Sample action: a42
Buffer size: 100
Policy Gradient Methods
Policy gradient methods directly optimize the policy without learning a value function. The policy is parameterized by a neural network that outputs action probabilities. The REINFORCE algorithm uses Monte Carlo returns: it runs an entire episode, computes the discounted return for each step, and updates the policy to increase the probability of actions that led to higher returns. Policy gradients naturally handle continuous action spaces and stochastic policies, unlike Q-learning which requires argmax over discrete actions.
import tensorflow as tf
from tensorflow import keras
class PolicyNetwork(keras.Model):
def __init__(self, n_actions):
super().__init__()
self.dense1 = keras.layers.Dense(24, activation='relu')
self.dense2 = keras.layers.Dense(24, activation='relu')
self.logits = keras.layers.Dense(n_actions)
def call(self, state):
x = self.dense1(state)
x = self.dense2(x)
return tf.nn.softmax(self.logits(x))
n_actions = 4
policy_net = PolicyNetwork(n_actions)
dummy_state = tf.constant([[0.1, 0.2, 0.3, 0.4, 0.5]])
action_probs = policy_net(dummy_state)
action = tf.random.categorical(tf.math.log(action_probs), 1)
print(f"Action probabilities: {action_probs.numpy().round(3)}")
print(f"Sampled action: {action.numpy()[0, 0]}")
print(f"Sum of probs: {action_probs.numpy().sum():.3f}")
Expected output:
Action probabilities: [[0.253 0.247 0.251 0.249]]
Sampled action: 2
Sum of probs: 1.000
RL Algorithm Comparison
| Algorithm | Type | State Space | Action Space | When to Use |
|---|---|---|---|---|
| Q-Learning | Value-based | Discrete | Discrete | Small state spaces, tabular problems |
| DQN | Value-based | Continuous | Discrete | High-dim input (images), discrete actions |
| Policy Gradient | Policy-based | Continuous | Continuous | Continuous control, stochastic policies |
| PPO | Actor-Critic | Continuous | Both | Stable training, general purpose |
| SAC | Actor-Critic | Continuous | Continuous | Sample-efficient continuous control |
Common Errors and Mistakes
| Mistake | Why It Happens | How to Fix |
|---|---|---|
| Learning rate too high | Policy collapses to deterministic early | Reduce lr, use adaptive optimizers |
| No exploration | Agent never discovers better actions | Use epsilon-greedy or entropy bonus |
| Replay buffer too small | Forgets important experiences | Set capacity to 100K+ |
| Target network not used | DQN training diverges | Freeze target network, update every N steps |
| Reward not normalized | Gradient magnitudes vary wildly | Normalize returns to mean=0, std=1 |
Practice Questions
- What is the difference between a policy and a value function in RL?
Answer: A policy directly maps states to actions. A value function estimates the expected return from a state (V) or state-action pair (Q). The policy can be derived from the value function (greedy with respect to Q) or learned directly.
- Why does experience replay improve DQN training?
Answer: Experience replay breaks the temporal correlation between consecutive samples by storing and randomly sampling past experiences. It also reuses experiences multiple times, improving sample efficiency.
- What is the exploration-exploitation trade-off?
Answer: The agent must balance exploring unknown actions to discover better rewards versus exploiting known good actions. Epsilon-greedy decays exploration over time. Policy gradient methods naturally explore through stochastic policies.
- How does the discount factor gamma affect agent behavior?
Answer: Gamma close to 1 makes the agent consider long-term rewards (far-sighted). Gamma close to 0 makes it focus on immediate rewards (myopic). Gamma of 0.99 is typical for most environments.
- What is the difference between on-policy and off-policy learning?
Answer: On-policy algorithms learn about the policy being executed (SARSA). Off-policy algorithms learn about the optimal policy while following a different behavior policy (Q-learning, DQN). Off-policy can reuse old experiences.
Challenge
Implement a DQN agent to solve the CartPole environment from OpenAI Gym. The agent must learn to balance a pole on a moving cart by applying left or right forces. Set up a replay buffer, target network, and epsilon decay schedule. Train until the agent achieves the maximum reward of 500 over 100 consecutive episodes.
Real-World Task
Design a reinforcement learning system for dynamic pricing in an e-commerce platform. The agent adjusts prices for products based on demand, competitor prices, and inventory levels. The reward function balances profit margin, sales volume, and customer satisfaction. Use a policy gradient method and simulate the environment with historical Transaction data to train the pricing policy.
Next Steps
Explore deep RL with TensorFlow and PyTorch implementations. Deploy RL agents using Docker containers and monitor training with MLflow experiment tracking.
Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro