Outlier Detection and Treatment in Python

DodaTech 4 min read

In this tutorial, you'll learn about Outlier Detection and Treatment in Python. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.

What You'll Learn

Detect and handle outliers in data using statistical methods, Machine Learning, and visualization — IQR, Z-score, isolation forest, and DBSCAN.

Why It Matters

Outliers can skew statistical analyses and hurt model performance. Detecting them isn't about removal — it's about understanding whether they're noise or signal.

Real-World Use

Detecting fraudulent transactions (outliers = fraud), removing sensor noise from IoT data, or finding data entry errors in a customer database.

What Are Outliers?

Outliers are data points that differ significantly from other observations. They can be:

Errors — sensor malfunction, data entry mistake
Interesting — fraud, rare event, discovery
Natural — legitimate extreme values (e.g., Bill Gates in a salary dataset)

Visualizing Outliers

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Generate data with outliers
np.random.seed(42)
data = np.random.randn(100) * 10 + 50
data = np.append(data, [120, 130, 5, 3])  # Add outliers

df = pd.DataFrame({"value": data})

# Box plot
plt.figure(figsize=(10, 4))
sns.boxplot(x=df["value"])
plt.title("Box Plot — Outliers Beyond Whiskers")
plt.show()

# Histogram
plt.figure(figsize=(10, 5))
sns.histplot(df["value"], bins=20)
plt.axvline(df["value"].mean(), color="red", linestyle="--", label="Mean")
plt.axvline(df["value"].median(), color="green", linestyle=":", label="Median")
plt.legend()
plt.show()

# Scatter
plt.figure(figsize=(12, 4))
plt.scatter(range(len(df)), df["value"], alpha=0.6)
plt.axhline(df["value"].mean() + 2 * df["value"].std(), color="r", linestyle="--", label="+2 Std")
plt.axhline(df["value"].mean() - 2 * df["value"].std(), color="r", linestyle="--", label="-2 Std")
plt.legend()
plt.show()

Method 1: IQR (Interquartile Range)

def detect_outliers_iqr(df, column):
    """Detect outliers using the IQR method."""
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    lower = Q1 - 1.5 * IQR
    upper = Q3 + 1.5 * IQR

    outliers = df[(df[column] < lower) | (df[column] > upper)]
    return outliers, lower, upper

outliers, lower, upper = detect_outliers_iqr(df, "value")
print(f"IQR bounds: [{lower:.2f}, {upper:.2f}]")
print(f"Outliers: {len(outliers)} points")
print(outliers)

Method 2: Z-Score (Standard Deviation)

from scipy import stats

def detect_outliers_zscore(df, column, threshold=3):
    """Detect outliers using Z-score method."""
    z_scores = np.abs(stats.zscore(df[column]))
    outliers = df[z_scores > threshold]
    return outliers

outliers = detect_outliers_zscore(df, "value", threshold=3)
print(f"Z-score outliers: {len(outliers)} points")

Method 3: Modified Z-Score (Robust)

def detect_outliers_modified_zscore(df, column, threshold=3.5):
    """Detect outliers using median absolute deviation."""
    median = df[column].median()
    mad = np.median(np.abs(df[column] - median))
    modified_z = 0.6745 * (df[column] - median) / mad
    outliers = df[np.abs(modified_z) > threshold]
    return outliers

outliers = detect_outliers_modified_zscore(df, "value")
print(f"Modified Z-score outliers: {len(outliers)} points")

Method 4: Isolation Forest

Good for high-dimensional data:

from sklearn.ensemble import IsolationForest

def detect_outliers_isolation_forest(df, columns, contamination=0.05):
    """Detect outliers using Isolation Forest."""
    model = IsolationForest(
        contamination=contamination,
        random_state=42,
    )
    df["outlier_score"] = model.fit_predict(df[columns])
    outliers = df[df["outlier_score"] == -1]
    return outliers

# Example with multiple columns
outliers = detect_outliers_isolation_forest(
    df, ["value"], contamination=0.1
)

Method 5: DBSCAN

Good for spatial outliers:

from sklearn.cluster import DBSCAN

def detect_outliers_dbscan(df, columns, eps=0.5, min_samples=5):
    """Detect outliers using DBSCAN clustering."""
    from sklearn.preprocessing import StandardScaler
    scaled = StandardScaler().fit_transform(df[columns])
    clustering = DBSCAN(eps=eps, min_samples=min_samples).fit(scaled)
    df["cluster"] = clustering.labels_
    outliers = df[df["cluster"] == -1]
    return outliers

Treating Outliers

# Option 1: Remove
df_clean = df[(df["value"] >= lower) & (df["value"] <= upper)]

# Option 2: Cap (Winsorize)
df["value_capped"] = df["value"].clip(lower, upper)

# Option 3: Replace with median
median = df["value"].median()
df["value_filled"] = df["value"].copy()
mask = (df["value"] < lower) | (df["value"] > upper)
df.loc[mask, "value_filled"] = median

# Option 4: Winsorization with scipy
from scipy.stats.mstats import winsorize
df["value_winsorized"] = winsorize(df["value"], limits=[0.05, 0.05])

# Compare results
print("Original mean:", df["value"].mean())
print("After removal:", df_clean["value"].mean())
print("After capping:", df["value_capped"].mean())

Choosing a Method

Method	Best For	Notes
IQR	Quick, intuitive	Works for symmetric distributions
Z-score	Normally distributed data	Assumes Gaussian
Modified Z-score	Robust to extreme outliers	Uses median
Isolation Forest	High-dimensional, many outliers	ML-based
DBSCAN	Clustered data with outliers	Finds unusual clusters
Visualization	Exploratory	Always do this first

When to Keep Outliers

✅ Keep if they represent real phenomena (fraud, peaks)
✅ Keep if they're the target (anomaly detection)
✅ Keep if domain says they're valid
✅ Keep if model is robust (tree-based)

❌ Remove if they're data entry errors
❌ Remove if they're sensor failures
❌ Remove if they break statistical assumptions
❌ Remove if they're from a different population

Complete Pipeline

def detect_and_report_outliers(df, numeric_cols=None):
    """Detect outliers using multiple methods and generate report."""
    if numeric_cols is None:
        numeric_cols = df.select_dtypes(include=[np.number]).columns

    report = []

    for col in numeric_cols:
        Q1 = df[col].quantile(0.25)
        Q3 = df[col].quantile(0.75)
        IQR = Q3 - Q1
        lower = Q1 - 1.5 * IQR
        upper = Q3 + 1.5 * IQR
        n_outliers = ((df[col] < lower) | (df[col] > upper)).sum()

        if n_outliers > 0:
            report.append({
                "column": col,
                "lower": lower,
                "upper": upper,
                "n_outliers": n_outliers,
                "pct_outliers": n_outliers / len(df) * 100,
            })

    return pd.DataFrame(report).sort_values("n_outliers", ascending=False)

← Previous Feature Engineering for Machine Learning Next → Data Normalization and Standardization Techniques

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro

Home Browse Data Science