Decision Trees and Random Forests Explained — Complete Guide
In this tutorial, you'll learn about Decision Trees and Random Forests Explained. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.
Decision trees split data into branches based on feature values, creating a tree-like structure where each leaf represents a decision. Random forests combine hundreds of trees to create models that are far more accurate and stable than any single tree.
What You'll Learn
How decision trees work, what makes them prone to overfitting, how random forests fix that weakness, and how to build both models in Scikit-Learn with real datasets.
Why It Matters
Random forests are among the most widely used ML algorithms in production because they work well with both numeric and categorical data, require minimal preprocessing, and provide feature importance rankings out of the box.
Real-World Use
Durga Antivirus Pro uses a random forest classifier as one of its detection layers. The model analyzes file metadata, entropy, and structural patterns to flag suspicious files, processing millions of files daily.
Decision Tree vs Random Forest
flowchart TD
subgraph Single Tree
A1[Root Node] --> B1[Split 1]
B1 --> C1[Leaf 1]
B1 --> C2[Leaf 2]
A1 --> D1[Split 2]
D1 --> E1[Leaf 3]
D1 --> E2[Leaf 4]
end
subgraph Random Forest
F1[Tree 1] --> G1[Vote]
F2[Tree 2] --> G1
F3[Tree 3] --> G1
F4[Tree n] --> G1
G1 --> H1[Final Prediction]
end
Building a Decision Tree
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import numpy as np
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
iris.data, iris.target, test_size=0.2, random_state=42
)
tree = DecisionTreeClassifier(max_depth=3, random_state=42)
tree.fit(X_train, y_train)
train_acc = tree.score(X_train, y_train)
test_acc = tree.score(X_test, y_test)
print(f"Training accuracy: {train_acc:.3f}")
print(f"Test accuracy: {test_acc:.3f}")
print(f"Tree depth: {tree.get_depth()}")
print(f"Number of leaves: {tree.get_n_leaves()}")
print(f"Feature importances: {dict(zip(iris.feature_names, tree.feature_importances_.round(3)))}")
Expected output:
Training accuracy: 0.983
Test accuracy: 1.000
Tree depth: 3
Number of leaves: 8
Feature importances: {'sepal length (cm)': 0.0, 'sepal width (cm)': 0.0, 'petal length (cm)': 0.555, 'petal width (cm)': 0.445}
Petal length and petal width are the most important features. The tree ignores sepal measurements entirely at depth 3.
Understanding Overfitting in Decision Trees
tree_deep = DecisionTreeClassifier(max_depth=None, random_state=42)
tree_deep.fit(X_train, y_train)
train_acc_deep = tree_deep.score(X_train, y_train)
test_acc_deep = tree_deep.score(X_test, y_test)
tree_shallow = DecisionTreeClassifier(max_depth=2, random_state=42)
tree_shallow.fit(X_train, y_train)
train_acc_shallow = tree_shallow.score(X_train, y_train)
test_acc_shallow = tree_shallow.score(X_test, y_test)
print(f"Deep tree (depth=inf): train={train_acc_deep:.3f}, test={test_acc_deep:.3f}")
print(f"Shallow tree (depth=2): train={train_acc_shallow:.3f}, test={test_acc_shallow:.3f}")
Expected output:
Deep tree (depth=inf): train=1.000, test=0.967
Shallow tree (depth=2): train=0.950, test=0.967
The deep tree memorized the training data (100% accuracy) but performs worse on test data. The shallow tree generalizes better despite lower training accuracy.
Random Forest
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(
n_estimators=100,
max_depth=5,
min_samples_split=5,
random_state=42
)
rf.fit(X_train, y_train)
train_acc_rf = rf.score(X_train, y_train)
test_acc_rf = rf.score(X_test, y_test)
print(f"Random Forest training accuracy: {train_acc_rf:.3f}")
print(f"Random Forest test accuracy: {test_acc_rf:.3f}")
importances = pd.DataFrame({
'feature': iris.feature_names,
'importance': rf.feature_importances_
}).sort_values('importance', ascending=False)
print("\nFeature Importances:")
print(importances)
Expected output:
Random Forest training accuracy: 0.983
Random Forest test accuracy: 1.000
Feature Importances:
feature importance
2 petal length (cm) 0.479
3 petal width (cm) 0.412
1 sepal width (cm) 0.078
0 sepal length (cm) 0.031
The random forest generalizes at least as well as the single tree and provides more stable feature importance estimates by averaging across trees trained on different Bootstrap samples.
Key Hyperparameters
- n_estimators: More trees = better performance, diminishing returns after 200-500
- max_depth: Deeper trees capture more patterns but overfit more
- min_samples_split: Higher values prevent splits with too few samples
- max_features: Controls randomness. Lower values reduce correlation between trees
Practice Questions
- Why does a random forest outperform a single decision tree?
- What causes overfitting in decision trees and how do random forests mitigate it?
- How do you interpret feature importance from a random forest?
Frequently Asked Questions
Related Topics
- Python — required to run examples
- scikit-learn Guide — provides both models
- What is Machine Learning — foundational concepts
Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro