Outlier Detection and Treatment in Python
In this tutorial, you'll learn about Outlier Detection and Treatment in Python. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.
What You'll Learn
Detect and handle outliers in data using statistical methods, Machine Learning, and visualization — IQR, Z-score, isolation forest, and DBSCAN.
Why It Matters
Outliers can skew statistical analyses and hurt model performance. Detecting them isn't about removal — it's about understanding whether they're noise or signal.
Real-World Use
Detecting fraudulent transactions (outliers = fraud), removing sensor noise from IoT data, or finding data entry errors in a customer database.
What Are Outliers?
Outliers are data points that differ significantly from other observations. They can be:
- Errors — sensor malfunction, data entry mistake
- Interesting — fraud, rare event, discovery
- Natural — legitimate extreme values (e.g., Bill Gates in a salary dataset)
Visualizing Outliers
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Generate data with outliers
np.random.seed(42)
data = np.random.randn(100) * 10 + 50
data = np.append(data, [120, 130, 5, 3]) # Add outliers
df = pd.DataFrame({"value": data})
# Box plot
plt.figure(figsize=(10, 4))
sns.boxplot(x=df["value"])
plt.title("Box Plot — Outliers Beyond Whiskers")
plt.show()
# Histogram
plt.figure(figsize=(10, 5))
sns.histplot(df["value"], bins=20)
plt.axvline(df["value"].mean(), color="red", linestyle="--", label="Mean")
plt.axvline(df["value"].median(), color="green", linestyle=":", label="Median")
plt.legend()
plt.show()
# Scatter
plt.figure(figsize=(12, 4))
plt.scatter(range(len(df)), df["value"], alpha=0.6)
plt.axhline(df["value"].mean() + 2 * df["value"].std(), color="r", linestyle="--", label="+2 Std")
plt.axhline(df["value"].mean() - 2 * df["value"].std(), color="r", linestyle="--", label="-2 Std")
plt.legend()
plt.show()
Method 1: IQR (Interquartile Range)
def detect_outliers_iqr(df, column):
"""Detect outliers using the IQR method."""
Q1 = df[column].quantile(0.25)
Q3 = df[column].quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR
outliers = df[(df[column] < lower) | (df[column] > upper)]
return outliers, lower, upper
outliers, lower, upper = detect_outliers_iqr(df, "value")
print(f"IQR bounds: [{lower:.2f}, {upper:.2f}]")
print(f"Outliers: {len(outliers)} points")
print(outliers)
Method 2: Z-Score (Standard Deviation)
from scipy import stats
def detect_outliers_zscore(df, column, threshold=3):
"""Detect outliers using Z-score method."""
z_scores = np.abs(stats.zscore(df[column]))
outliers = df[z_scores > threshold]
return outliers
outliers = detect_outliers_zscore(df, "value", threshold=3)
print(f"Z-score outliers: {len(outliers)} points")
Method 3: Modified Z-Score (Robust)
def detect_outliers_modified_zscore(df, column, threshold=3.5):
"""Detect outliers using median absolute deviation."""
median = df[column].median()
mad = np.median(np.abs(df[column] - median))
modified_z = 0.6745 * (df[column] - median) / mad
outliers = df[np.abs(modified_z) > threshold]
return outliers
outliers = detect_outliers_modified_zscore(df, "value")
print(f"Modified Z-score outliers: {len(outliers)} points")
Method 4: Isolation Forest
Good for high-dimensional data:
from sklearn.ensemble import IsolationForest
def detect_outliers_isolation_forest(df, columns, contamination=0.05):
"""Detect outliers using Isolation Forest."""
model = IsolationForest(
contamination=contamination,
random_state=42,
)
df["outlier_score"] = model.fit_predict(df[columns])
outliers = df[df["outlier_score"] == -1]
return outliers
# Example with multiple columns
outliers = detect_outliers_isolation_forest(
df, ["value"], contamination=0.1
)
Method 5: DBSCAN
Good for spatial outliers:
from sklearn.cluster import DBSCAN
def detect_outliers_dbscan(df, columns, eps=0.5, min_samples=5):
"""Detect outliers using DBSCAN clustering."""
from sklearn.preprocessing import StandardScaler
scaled = StandardScaler().fit_transform(df[columns])
clustering = DBSCAN(eps=eps, min_samples=min_samples).fit(scaled)
df["cluster"] = clustering.labels_
outliers = df[df["cluster"] == -1]
return outliers
Treating Outliers
# Option 1: Remove
df_clean = df[(df["value"] >= lower) & (df["value"] <= upper)]
# Option 2: Cap (Winsorize)
df["value_capped"] = df["value"].clip(lower, upper)
# Option 3: Replace with median
median = df["value"].median()
df["value_filled"] = df["value"].copy()
mask = (df["value"] < lower) | (df["value"] > upper)
df.loc[mask, "value_filled"] = median
# Option 4: Winsorization with scipy
from scipy.stats.mstats import winsorize
df["value_winsorized"] = winsorize(df["value"], limits=[0.05, 0.05])
# Compare results
print("Original mean:", df["value"].mean())
print("After removal:", df_clean["value"].mean())
print("After capping:", df["value_capped"].mean())
Choosing a Method
| Method | Best For | Notes |
|---|---|---|
| IQR | Quick, intuitive | Works for symmetric distributions |
| Z-score | Normally distributed data | Assumes Gaussian |
| Modified Z-score | Robust to extreme outliers | Uses median |
| Isolation Forest | High-dimensional, many outliers | ML-based |
| DBSCAN | Clustered data with outliers | Finds unusual clusters |
| Visualization | Exploratory | Always do this first |
When to Keep Outliers
✅ Keep if they represent real phenomena (fraud, peaks)
✅ Keep if they're the target (anomaly detection)
✅ Keep if domain says they're valid
✅ Keep if model is robust (tree-based)
❌ Remove if they're data entry errors
❌ Remove if they're sensor failures
❌ Remove if they break statistical assumptions
❌ Remove if they're from a different population
Complete Pipeline
def detect_and_report_outliers(df, numeric_cols=None):
"""Detect outliers using multiple methods and generate report."""
if numeric_cols is None:
numeric_cols = df.select_dtypes(include=[np.number]).columns
report = []
for col in numeric_cols:
Q1 = df[col].quantile(0.25)
Q3 = df[col].quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR
n_outliers = ((df[col] < lower) | (df[col] > upper)).sum()
if n_outliers > 0:
report.append({
"column": col,
"lower": lower,
"upper": upper,
"n_outliers": n_outliers,
"pct_outliers": n_outliers / len(df) * 100,
})
return pd.DataFrame(report).sort_values("n_outliers", ascending=False)
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro