Skip to content

Pandas Data Cleaning — Handling Missing Data and Duplicates

DodaTech 3 min read

In this tutorial, you'll learn about Pandas Data Cleaning. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.

What You'll Learn

Clean messy data with pandas — detect missing values, fill or drop them, remove duplicates, fix data types, handle outliers, and standardize text data.

Why It Matters

Real-world data is messy. 80% of a data scientist's time is spent cleaning data. Mastering data cleaning is essential before any analysis or modeling.

Real-World Use

Cleaning a customer database with missing phone numbers, removing duplicate transactions from a sales report, or standardizing date formats from multiple sources.

Detecting Missing Data

import pandas as pd
import numpy as np

df = pd.DataFrame({
    "name": ["Alice", "Bob", None, "Diana"],
    "age": [25, None, 35, None],
    "salary": [50000, 60000, 70000, None],
    "start_date": ["2021-01-01", "2020-05-15", None, "2022-03-01"],
})

# Check for missing values
df.isnull()
df.isnull().sum()
# name          1
# age           2
# salary        1
# start_date    1

# Percentage missing
df.isnull().sum() / len(df) * 100

# Visualize missing data
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 4))
plt.imshow(df.isnull(), cmap="viridis", aspect="auto")
plt.yticks(range(len(df)), df.index)
plt.colorbar(label="Missing")
plt.show()

Handling Missing Values

Drop Missing Values

# Drop rows with ANY missing value
df_clean = df.dropna()

# Drop rows where ALL values are missing
df_clean = df.dropna(how="all")

# Drop rows with missing in specific columns
df_clean = df.dropna(subset=["name", "salary"])

# Drop columns with too many missing
df_clean = df.dropna(thresh=len(df) * 0.5, axis=1)

Fill Missing Values

# Fill with a constant
df["salary"].fillna(0, inplace=True)

# Fill with mean/median/mode
df["age"].fillna(df["age"].mean(), inplace=True)       # Mean
df["age"].fillna(df["age"].median(), inplace=True)     # Median
df["department"].fillna(df["department"].mode()[0], inplace=True)  # Mode

# Forward fill (use previous value)
df["start_date"].fillna(method="ffill", inplace=True)

# Backward fill (use next value)
df["start_date"].fillna(method="bfill", inplace=True)

# Interpolate (linear for numeric)
df["salary"].interpolate(method="linear", inplace=True)

Removing Duplicates

# Check for duplicates
df.duplicated()
df.duplicated().sum()

# Check duplicates based on specific columns
df.duplicated(subset=["email"])

# Drop duplicates
df.drop_duplicates(inplace=True)

# Keep first or last occurrence
df.drop_duplicates(keep="first", inplace=True)
df.drop_duplicates(keep="last", inplace=True)

# Drop duplicates based on subset
df.drop_duplicates(subset=["customer_id"], keep="last", inplace=True)

Fixing Data Types

# Check data types
df.dtypes

# Convert string to numeric
df["salary"] = pd.to_numeric(df["salary"], errors="coerce")

# Convert to datetime
df["start_date"] = pd.to_datetime(df["start_date"], errors="coerce")

# Convert to category
df["department"] = df["department"].astype("category")

# Convert string to boolean
df["active"] = df["active"].map({"Yes": True, "No": False})

Handling Outliers

# Detect outliers with IQR method
Q1 = df["age"].quantile(0.25)
Q3 = df["age"].quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR

outliers = df[(df["age"] < lower) | (df["age"] > upper)]

# Cap outliers
df["age"] = df["age"].clip(lower, upper)

# Remove outliers
df_clean = df[(df["age"] >= lower) & (df["age"] <= upper)]

Standardizing Text

# Clean text columns
df["name"] = df["name"].str.strip()           # Remove whitespace
df["name"] = df["name"].str.lower()           # Lowercase
df["name"] = df["name"].str.title()           # Title case

# Replace patterns
df["phone"] = df["phone"].str.replace(r"\D", "", regex=True)  # Remove non-digits
df["email"] = df["email"].str.strip().str.lower()

Complete Cleaning Workflow

def clean_dataset(df):
    """Standard data cleaning pipeline."""

    # 1. Remove duplicates
    df = df.drop_duplicates()

    # 2. Fix data types
    for col in df.select_dtypes("object").columns:
        try:
            df[col] = pd.to_datetime(df[col])
        except (ValueError, TypeError):
            pass

    # 3. Handle missing values
    for col in df.select_dtypes(include=[np.number]).columns:
        df[col].fillna(df[col].median(), inplace=True)

    # 4. Remove outliers (capping)
    for col in df.select_dtypes(include=[np.number]).columns:
        Q1, Q3 = df[col].quantile(0.25), df[col].quantile(0.75)
        IQR = Q3 - Q1
        lower, upper = Q1 - 1.5 * IQR, Q3 + 1.5 * IQR
        df[col] = df[col].clip(lower, upper)

    # 5. Standardize text
    for col in df.select_dtypes("object").columns:
        df[col] = df[col].str.strip().str.lower()

    return df

Practice

  1. Load a dataset with missing values
  2. Identify which columns have missing data
  3. Decide on a Strategy for each column (drop, fill, interpolate)
  4. Remove duplicate rows
  5. Fix any incorrect data types
  6. Export the cleaned dataset to a new CSV

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro