Data Normalization and Standardization Techniques
In this tutorial, you'll learn about Data Normalization and Standardization Techniques. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.
What You'll Learn
Scale and normalize data for Machine Learning — Min-Max scaling, Standardization (Z-score), Robust scaling, and how to choose and apply each method.
Why It Matters
Many ML algorithms (SVM, neural networks, KNN, PCA) require features to be on similar scales. Unscaled features cause biased models and slow convergence.
Real-World Use
Preprocessing customer data before clustering, scaling pixel values for image classification, or normalizing sensor readings from different units.
When Scaling Matters
# KNN without scaling: distance is dominated by large-scale features
features = {
"age": [25, 30, 35, 28, 22], # Range: ~10
"income": [50000, 60000, 70000, 55000, 45000], # Range: ~25000
}
# Distance between person 1 and person 2
# Without scaling: income swamps age
# With scaling: both features contribute equally
Algorithms that need scaling: SVM, KNN, neural networks, PCA, logistic regression, K-means, linear regression (with regularization).
Algorithms that don't: Decision trees, Random Forest, Gradient Boosting, Naive Bayes.
Min-Max Scaling (Normalization)
Scales to a fixed range, typically [0, 1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
df = pd.DataFrame({
"age": [25, 30, 35, 28, 50],
"income": [50000, 60000, 70000, 55000, 120000],
"experience": [2, 5, 10, 3, 25],
})
scaler = MinMaxScaler()
scaled = scaler.fit_transform(df)
scaled_df = pd.DataFrame(scaled, columns=df.columns)
print(scaled_df)
# age income experience
# 0 0.00 0.00000 0.0000
# 1 0.20 0.14286 0.1304
# 2 0.40 0.28571 0.3478
# 3 0.12 0.07143 0.0435
# 4 1.00 1.00000 1.0000
# Inverse transform
original = scaler.inverse_transform(scaled)
Formula: X_scaled = (X - X_min) / (X_max - X_min)
Use when: You know the bounds, data is uniformly distributed, or you need features in [0, 1].
Standardization (Z-Score)
Centers around 0 with standard deviation 1:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaled = scaler.fit_transform(df)
scaled_df = pd.DataFrame(scaled, columns=df.columns)
print(scaled_df.round(2))
# age income experience
# 0 -0.69 -0.61 -0.82
# 1 -0.23 -0.37 -0.54
# 2 0.23 -0.13 0.00
# 3 -0.46 -0.49 -0.68
# 4 1.15 1.60 2.04
# Check: mean ≈ 0, std ≈ 1
print(scaled_df.mean().round(2)) # ~0
print(scaled_df.std().round(2)) # ~1
Formula: X_scaled = (X - μ) / σ
Use when: Data is normally distributed, you want zero-centered data, or the algorithm assumes Gaussian distribution.
Robust Scaling
Uses median and IQR — robust to outliers:
from sklearn.preprocessing import RobustScaler
# Data with an outlier
df_outlier = pd.DataFrame({
"value": [10, 12, 11, 13, 12, 100] # 100 is an outlier
})
standard = StandardScaler()
robust = RobustScaler()
standard_scaled = standard.fit_transform(df_outlier)
robust_scaled = robust.fit_transform(df_outlier)
print("Standard:", standard_scaled.flatten().round(2))
# [-0.42 -0.39 -0.41 -0.37 -0.39 1.98]
# Outlier still dominates
print("Robust:", robust_scaled.flatten().round(2))
# [-0.80 -0.40 -0.60 -0.20 -0.40 2.60]
# Normal values are properly scaled
Use when: Data has outliers that shouldn't dominate scaling.
Which One to Choose?
| Method | Range | Center | Robust to Outliers? | Best For |
|---|---|---|---|---|
| Min-Max | [0, 1] | Not centered | No | Neural networks, image pixels |
| Standardization | Unbounded | 0 | No | Most ML algorithms |
| Robust | Unbounded | Median | Yes | Data with outliers |
| MaxAbs | [-1, 1] | 0 | No | Sparse data |
Applying to Train/Test Split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Fit ONLY on training data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
# Transform test data with training parameters
X_test_scaled = scaler.transform(X_test)
# ⚠️ Never fit on test data — that's data leakage
Pipeline Integration
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
pipeline = Pipeline([
("scaler", StandardScaler()),
("svm", SVC()),
])
pipeline.fit(X_train, y_train)
pipeline.score(X_test, y_test)
Visualization Comparison
import matplotlib.pyplot as plt
fig, axes = plt.subplots(2, 2, figsize=(12, 8))
# Original
axes[0, 0].hist(df["income"], bins=15, edgecolor="black")
axes[0, 0].set_title("Original")
# Min-Max
axes[0, 1].hist(MinMaxScaler().fit_transform(df[["income"]]), bins=15, edgecolor="black")
axes[0, 1].set_title("Min-Max [0, 1]")
# Standard
axes[1, 0].hist(StandardScaler().fit_transform(df[["income"]]), bins=15, edgecolor="black")
axes[1, 0].set_title("Standardization (μ=0, σ=1)")
# Robust
axes[1, 1].hist(RobustScaler().fit_transform(df[["income"]]), bins=15, edgecolor="black")
axes[1, 1].set_title("Robust (median-based)")
plt.tight_layout()
plt.show()
Quick Reference
from sklearn.preprocessing import (
StandardScaler, # Z-score: (x - μ) / σ
MinMaxScaler, # [0, 1]: (x - min) / (max - min)
RobustScaler, # Median/IQR based
MaxAbsScaler, # [-1, 1]: x / max(|x|)
Normalizer, # Unit norm (L1, L2, max)
)
# Usage pattern (always the same):
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_new_scaled = scaler.transform(X_new) # Uses training statistics
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro