Pandas Data Cleaning — Handling Missing Data and Duplicates
DodaTech
3 min read
In this tutorial, you'll learn about Pandas Data Cleaning. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.
What You'll Learn
Clean messy data with pandas — detect missing values, fill or drop them, remove duplicates, fix data types, handle outliers, and standardize text data.
Why It Matters
Real-world data is messy. 80% of a data scientist's time is spent cleaning data. Mastering data cleaning is essential before any analysis or modeling.
Real-World Use
Cleaning a customer database with missing phone numbers, removing duplicate transactions from a sales report, or standardizing date formats from multiple sources.
Detecting Missing Data
import pandas as pd
import numpy as np
df = pd.DataFrame({
"name": ["Alice", "Bob", None, "Diana"],
"age": [25, None, 35, None],
"salary": [50000, 60000, 70000, None],
"start_date": ["2021-01-01", "2020-05-15", None, "2022-03-01"],
})
# Check for missing values
df.isnull()
df.isnull().sum()
# name 1
# age 2
# salary 1
# start_date 1
# Percentage missing
df.isnull().sum() / len(df) * 100
# Visualize missing data
import matplotlib.pyplot as plt
plt.figure(figsize=(10, 4))
plt.imshow(df.isnull(), cmap="viridis", aspect="auto")
plt.yticks(range(len(df)), df.index)
plt.colorbar(label="Missing")
plt.show()
Handling Missing Values
Drop Missing Values
# Drop rows with ANY missing value
df_clean = df.dropna()
# Drop rows where ALL values are missing
df_clean = df.dropna(how="all")
# Drop rows with missing in specific columns
df_clean = df.dropna(subset=["name", "salary"])
# Drop columns with too many missing
df_clean = df.dropna(thresh=len(df) * 0.5, axis=1)
Fill Missing Values
# Fill with a constant
df["salary"].fillna(0, inplace=True)
# Fill with mean/median/mode
df["age"].fillna(df["age"].mean(), inplace=True) # Mean
df["age"].fillna(df["age"].median(), inplace=True) # Median
df["department"].fillna(df["department"].mode()[0], inplace=True) # Mode
# Forward fill (use previous value)
df["start_date"].fillna(method="ffill", inplace=True)
# Backward fill (use next value)
df["start_date"].fillna(method="bfill", inplace=True)
# Interpolate (linear for numeric)
df["salary"].interpolate(method="linear", inplace=True)
Removing Duplicates
# Check for duplicates
df.duplicated()
df.duplicated().sum()
# Check duplicates based on specific columns
df.duplicated(subset=["email"])
# Drop duplicates
df.drop_duplicates(inplace=True)
# Keep first or last occurrence
df.drop_duplicates(keep="first", inplace=True)
df.drop_duplicates(keep="last", inplace=True)
# Drop duplicates based on subset
df.drop_duplicates(subset=["customer_id"], keep="last", inplace=True)
Fixing Data Types
# Check data types
df.dtypes
# Convert string to numeric
df["salary"] = pd.to_numeric(df["salary"], errors="coerce")
# Convert to datetime
df["start_date"] = pd.to_datetime(df["start_date"], errors="coerce")
# Convert to category
df["department"] = df["department"].astype("category")
# Convert string to boolean
df["active"] = df["active"].map({"Yes": True, "No": False})
Handling Outliers
# Detect outliers with IQR method
Q1 = df["age"].quantile(0.25)
Q3 = df["age"].quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR
outliers = df[(df["age"] < lower) | (df["age"] > upper)]
# Cap outliers
df["age"] = df["age"].clip(lower, upper)
# Remove outliers
df_clean = df[(df["age"] >= lower) & (df["age"] <= upper)]
Standardizing Text
# Clean text columns
df["name"] = df["name"].str.strip() # Remove whitespace
df["name"] = df["name"].str.lower() # Lowercase
df["name"] = df["name"].str.title() # Title case
# Replace patterns
df["phone"] = df["phone"].str.replace(r"\D", "", regex=True) # Remove non-digits
df["email"] = df["email"].str.strip().str.lower()
Complete Cleaning Workflow
def clean_dataset(df):
"""Standard data cleaning pipeline."""
# 1. Remove duplicates
df = df.drop_duplicates()
# 2. Fix data types
for col in df.select_dtypes("object").columns:
try:
df[col] = pd.to_datetime(df[col])
except (ValueError, TypeError):
pass
# 3. Handle missing values
for col in df.select_dtypes(include=[np.number]).columns:
df[col].fillna(df[col].median(), inplace=True)
# 4. Remove outliers (capping)
for col in df.select_dtypes(include=[np.number]).columns:
Q1, Q3 = df[col].quantile(0.25), df[col].quantile(0.75)
IQR = Q3 - Q1
lower, upper = Q1 - 1.5 * IQR, Q3 + 1.5 * IQR
df[col] = df[col].clip(lower, upper)
# 5. Standardize text
for col in df.select_dtypes("object").columns:
df[col] = df[col].str.strip().str.lower()
return df
Practice
- Load a dataset with missing values
- Identify which columns have missing data
- Decide on a Strategy for each column (drop, fill, interpolate)
- Remove duplicate rows
- Fix any incorrect data types
- Export the cleaned dataset to a new CSV
← Previous
Pandas Tutorial — DataFrames and Series Explained
Next →
Pandas GroupBy and Aggregation — Analyze Data by Categories
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro