Data Normalization and Standardization Techniques

DodaTech 3 min read

In this tutorial, you'll learn about Data Normalization and Standardization Techniques. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.

What You'll Learn

Scale and normalize data for Machine Learning — Min-Max scaling, Standardization (Z-score), Robust scaling, and how to choose and apply each method.

Why It Matters

Many ML algorithms (SVM, neural networks, KNN, PCA) require features to be on similar scales. Unscaled features cause biased models and slow convergence.

Real-World Use

Preprocessing customer data before clustering, scaling pixel values for image classification, or normalizing sensor readings from different units.

When Scaling Matters

# KNN without scaling: distance is dominated by large-scale features
features = {
    "age": [25, 30, 35, 28, 22],         # Range: ~10
    "income": [50000, 60000, 70000, 55000, 45000],  # Range: ~25000
}

# Distance between person 1 and person 2
# Without scaling: income swamps age
# With scaling: both features contribute equally

Algorithms that need scaling: SVM, KNN, neural networks, PCA, logistic regression, K-means, linear regression (with regularization).

Algorithms that don't: Decision trees, Random Forest, Gradient Boosting, Naive Bayes.

Min-Max Scaling (Normalization)

Scales to a fixed range, typically [0, 1]:

import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler

df = pd.DataFrame({
    "age": [25, 30, 35, 28, 50],
    "income": [50000, 60000, 70000, 55000, 120000],
    "experience": [2, 5, 10, 3, 25],
})

scaler = MinMaxScaler()
scaled = scaler.fit_transform(df)
scaled_df = pd.DataFrame(scaled, columns=df.columns)

print(scaled_df)
#     age    income  experience
# 0  0.00  0.00000      0.0000
# 1  0.20  0.14286      0.1304
# 2  0.40  0.28571      0.3478
# 3  0.12  0.07143      0.0435
# 4  1.00  1.00000      1.0000

# Inverse transform
original = scaler.inverse_transform(scaled)

Formula: X_scaled = (X - X_min) / (X_max - X_min)

Use when: You know the bounds, data is uniformly distributed, or you need features in [0, 1].

Standardization (Z-Score)

Centers around 0 with standard deviation 1:

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaled = scaler.fit_transform(df)
scaled_df = pd.DataFrame(scaled, columns=df.columns)

print(scaled_df.round(2))
#     age  income  experience
# 0 -0.69   -0.61       -0.82
# 1 -0.23   -0.37       -0.54
# 2  0.23   -0.13        0.00
# 3 -0.46   -0.49       -0.68
# 4  1.15    1.60        2.04

# Check: mean ≈ 0, std ≈ 1
print(scaled_df.mean().round(2))  # ~0
print(scaled_df.std().round(2))   # ~1

Formula: X_scaled = (X - μ) / σ

Use when: Data is normally distributed, you want zero-centered data, or the algorithm assumes Gaussian distribution.

Robust Scaling

Uses median and IQR — robust to outliers:

from sklearn.preprocessing import RobustScaler

# Data with an outlier
df_outlier = pd.DataFrame({
    "value": [10, 12, 11, 13, 12, 100]  # 100 is an outlier
})

standard = StandardScaler()
robust = RobustScaler()

standard_scaled = standard.fit_transform(df_outlier)
robust_scaled = robust.fit_transform(df_outlier)

print("Standard:", standard_scaled.flatten().round(2))
# [-0.42 -0.39 -0.41 -0.37 -0.39  1.98]
# Outlier still dominates

print("Robust:", robust_scaled.flatten().round(2))
# [-0.80 -0.40 -0.60 -0.20 -0.40  2.60]
# Normal values are properly scaled

Use when: Data has outliers that shouldn't dominate scaling.

Which One to Choose?

Method	Range	Center	Robust to Outliers?	Best For
Min-Max	[0, 1]	Not centered	No	Neural networks, image pixels
Standardization	Unbounded	0	No	Most ML algorithms
Robust	Unbounded	Median	Yes	Data with outliers
MaxAbs	[-1, 1]	0	No	Sparse data

Applying to Train/Test Split

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Fit ONLY on training data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)

# Transform test data with training parameters
X_test_scaled = scaler.transform(X_test)
# ⚠️ Never fit on test data — that's data leakage

Pipeline Integration

from sklearn.pipeline import Pipeline
from sklearn.svm import SVC

pipeline = Pipeline([
    ("scaler", StandardScaler()),
    ("svm", SVC()),
])

pipeline.fit(X_train, y_train)
pipeline.score(X_test, y_test)

Visualization Comparison

import matplotlib.pyplot as plt

fig, axes = plt.subplots(2, 2, figsize=(12, 8))

# Original
axes[0, 0].hist(df["income"], bins=15, edgecolor="black")
axes[0, 0].set_title("Original")

# Min-Max
axes[0, 1].hist(MinMaxScaler().fit_transform(df[["income"]]), bins=15, edgecolor="black")
axes[0, 1].set_title("Min-Max [0, 1]")

# Standard
axes[1, 0].hist(StandardScaler().fit_transform(df[["income"]]), bins=15, edgecolor="black")
axes[1, 0].set_title("Standardization (μ=0, σ=1)")

# Robust
axes[1, 1].hist(RobustScaler().fit_transform(df[["income"]]), bins=15, edgecolor="black")
axes[1, 1].set_title("Robust (median-based)")

plt.tight_layout()
plt.show()

Quick Reference

from sklearn.preprocessing import (
    StandardScaler,     # Z-score: (x - μ) / σ
    MinMaxScaler,       # [0, 1]: (x - min) / (max - min)
    RobustScaler,       # Median/IQR based
    MaxAbsScaler,       # [-1, 1]: x / max(|x|)
    Normalizer,         # Unit norm (L1, L2, max)
)

# Usage pattern (always the same):
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_new_scaled = scaler.transform(X_new)  # Uses training statistics

← Previous Outlier Detection and Treatment in Python Next → Correlation Analysis with Pandas and Seaborn

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro

Home Browse Data Science