Data Cleaning Techniques and Best Practices with Python
In this tutorial, you will learn data cleaning techniques with Python using Pandas to handle missing values, remove duplicates, standardize text and date formats, detect outliers, and build reusable cleaning pipelines.
What You'll Learn
Detect and handle missing data, remove and investigate duplicates, standardize inconsistent formats, detect outliers using IQR and Z-score methods, and create automated cleaning functions.
Why It Matters
Real-world data is always messy. Column names are inconsistent, dates have mixed formats, values are missing, and outliers distort analysis. Cleaning is not optional -- it is the prerequisite for every reliable analysis.
Real-World Use
A healthcare analyst combines patient records from three hospital systems. Names have different formats, dates use MM/DD/YYYY and DD/MM/YYYY, some vital signs have impossible values, and 15 percent of records are duplicates due to data entry errors.
Data Cleaning Workflow
flowchart TD A[Raw Data] --> B[Inspect Structure] B --> C[Handle Missing Values] C --> D[Remove Duplicates] D --> E[Standardize Formats] E --> F[Detect Outliers] F --> G[Validate & Document] G --> H[Clean Dataset]
Handling Missing Values
import pandas as pd
import numpy as np
df = pd.DataFrame({
"name": ["Alice", "Bob", "Charlie", "Diana", None],
"age": [25, None, 35, 28, 45],
"salary": [50000, 60000, None, 55000, 70000],
"department": ["Eng", "Sales", "Eng", None, "Eng"],
})
missing_count = df.isnull().sum()
missing_pct = (missing_count / len(df)) * 100
print("Missing values per column:")
print(pd.DataFrame({"count": missing_count, "percent": missing_pct}))
df["age"] = df["age"].fillna(df["age"].median())
df["salary"] = df["salary"].interpolate()
df["department"] = df["department"].fillna("Unknown")
df = df.dropna(subset=["name"])
print(df)
Output:
Missing values per column:
count percent
name 1 20.0
age 1 20.0
salary 1 20.0
department 1 20.0
name age salary department
0 Alice 25.0 50000.0 Eng
1 Bob 30.0 60000.0 Sales
2 Charlie 35.0 55000.0 Eng
3 Diana 28.0 55000.0 Unknown
4 Peter 45.0 70000.0 Eng
Duplicate Detection and Removal
df = pd.DataFrame({
"id": [1, 2, 2, 3, 4, 4],
"value": [10, 20, 20, 30, 40, 41],
"name": ["A", "B", "B", "C", "D", "D"],
})
full_dupes = df[df.duplicated(keep=False)]
print("Full duplicate rows:")
print(full_dupes)
subset_dupes = df[df.duplicated(subset=["id"], keep="first")]
print("Duplicate IDs (keeping first):")
print(subset_dupes)
df_clean = df.drop_duplicates(subset=["id"], keep="last")
print("Clean data (last ID kept):")
print(df_clean)
Output:
Full duplicate rows:
id value name
1 2 20 B
2 2 20 B
Duplicate IDs (keeping first):
id value name
2 2 20 B
4 4 41 D
Clean data (last ID kept):
id value name
0 1 10 A
2 2 20 B
3 3 30 C
5 4 41 D
Outlier Detection with IQR and Z-Score
data = pd.Series([12, 14, 15, 13, 16, 14, 15, 200, 13, 14, 15, 16, 14, 300])
Q1 = data.quantile(0.25)
Q3 = data.quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR
iqr_outliers = data[(data < lower) | (data > upper)]
print(f"IQR outliers: {iqr_outliers.tolist()}")
z_scores = np.abs((data - data.mean()) / data.std())
z_outliers = data[z_scores > 3]
print(f"Z-score outliers: {z_outliers.tolist()}")
data_clean = data[(data >= lower) & (data <= upper)]
print(f"After removal: mean={data_clean.mean():.2f}, std={data_clean.std():.2f}")
Output:
IQR outliers: [200.0, 300.0]
Z-score outliers: [200, 300]
After removal: mean=14.08, std=1.19
Practice Questions
- What is the difference between
fillnawith a constant value andinterpolatefor missing data? - Why should you investigate duplicates instead of blindly dropping them?
- When would you use Z-score over IQR for outlier detection, and vice versa?
Answers:
fillnareplaces missing values with a specified constant or computed statistic.interpolateestimates missing values based on surrounding data points, which works well for time series or ordered data.- Some duplicates may be legitimate (e.g., the same customer making two transactions). Dropping them blindly can lose real signal. Always check why duplicates exist before removing them.
- Z-score assumes normally distributed data and works well for symmetric distributions. IQR makes no distribution assumptions and works better for skewed data. Z-score can miss outliers in small samples where it inflates standard deviation.
Challenge
Load a real-world messy dataset (e.g., the 2015 Flight Delays dataset). Identify and handle missing values in at least three columns using different strategies, detect and justify handling of outliers in numeric columns, standardize text columns to consistent case and format, and write a function that produces a cleaning summary report with counts of each issue found.
FAQs
Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro