Data Cleaning Techniques and Best Practices with Python

DodaTech 4 min read

In this tutorial, you will learn data cleaning techniques with Python using Pandas to handle missing values, remove duplicates, standardize text and date formats, detect outliers, and build reusable cleaning pipelines.

What You'll Learn

Detect and handle missing data, remove and investigate duplicates, standardize inconsistent formats, detect outliers using IQR and Z-score methods, and create automated cleaning functions.

Why It Matters

Real-world data is always messy. Column names are inconsistent, dates have mixed formats, values are missing, and outliers distort analysis. Cleaning is not optional -- it is the prerequisite for every reliable analysis.

Real-World Use

A healthcare analyst combines patient records from three hospital systems. Names have different formats, dates use MM/DD/YYYY and DD/MM/YYYY, some vital signs have impossible values, and 15 percent of records are duplicates due to data entry errors.

Data Cleaning Workflow

flowchart TD
  A[Raw Data] --> B[Inspect Structure]
  B --> C[Handle Missing Values]
  C --> D[Remove Duplicates]
  D --> E[Standardize Formats]
  E --> F[Detect Outliers]
  F --> G[Validate & Document]
  G --> H[Clean Dataset]

Handling Missing Values

import pandas as pd
import numpy as np

df = pd.DataFrame({
    "name": ["Alice", "Bob", "Charlie", "Diana", None],
    "age": [25, None, 35, 28, 45],
    "salary": [50000, 60000, None, 55000, 70000],
    "department": ["Eng", "Sales", "Eng", None, "Eng"],
})

missing_count = df.isnull().sum()
missing_pct = (missing_count / len(df)) * 100
print("Missing values per column:")
print(pd.DataFrame({"count": missing_count, "percent": missing_pct}))

df["age"] = df["age"].fillna(df["age"].median())
df["salary"] = df["salary"].interpolate()
df["department"] = df["department"].fillna("Unknown")
df = df.dropna(subset=["name"])
print(df)

Output:

Missing values per column:
        count  percent
name        1     20.0
age         1     20.0
salary      1     20.0
department  1     20.0

      name   age   salary department
0    Alice  25.0  50000.0        Eng
1      Bob  30.0  60000.0      Sales
2  Charlie  35.0  55000.0        Eng
3    Diana  28.0  55000.0    Unknown
4    Peter  45.0  70000.0        Eng

Duplicate Detection and Removal

df = pd.DataFrame({
    "id": [1, 2, 2, 3, 4, 4],
    "value": [10, 20, 20, 30, 40, 41],
    "name": ["A", "B", "B", "C", "D", "D"],
})

full_dupes = df[df.duplicated(keep=False)]
print("Full duplicate rows:")
print(full_dupes)

subset_dupes = df[df.duplicated(subset=["id"], keep="first")]
print("Duplicate IDs (keeping first):")
print(subset_dupes)

df_clean = df.drop_duplicates(subset=["id"], keep="last")
print("Clean data (last ID kept):")
print(df_clean)

Output:

Full duplicate rows:
   id  value name
1   2     20    B
2   2     20    B

Duplicate IDs (keeping first):
   id  value name
2   2     20    B
4   4     41    D

Clean data (last ID kept):
   id  value name
0   1     10    A
2   2     20    B
3   3     30    C
5   4     41    D

Outlier Detection with IQR and Z-Score

data = pd.Series([12, 14, 15, 13, 16, 14, 15, 200, 13, 14, 15, 16, 14, 300])

Q1 = data.quantile(0.25)
Q3 = data.quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR
iqr_outliers = data[(data < lower) | (data > upper)]
print(f"IQR outliers: {iqr_outliers.tolist()}")

z_scores = np.abs((data - data.mean()) / data.std())
z_outliers = data[z_scores > 3]
print(f"Z-score outliers: {z_outliers.tolist()}")

data_clean = data[(data >= lower) & (data <= upper)]
print(f"After removal: mean={data_clean.mean():.2f}, std={data_clean.std():.2f}")

Output:

IQR outliers: [200.0, 300.0]
Z-score outliers: [200, 300]
After removal: mean=14.08, std=1.19

Practice Questions

What is the difference between fillna with a constant value and interpolate for missing data?
Why should you investigate duplicates instead of blindly dropping them?
When would you use Z-score over IQR for outlier detection, and vice versa?

Answers:

fillna replaces missing values with a specified constant or computed statistic. interpolate estimates missing values based on surrounding data points, which works well for time series or ordered data.
Some duplicates may be legitimate (e.g., the same customer making two transactions). Dropping them blindly can lose real signal. Always check why duplicates exist before removing them.
Z-score assumes normally distributed data and works well for symmetric distributions. IQR makes no distribution assumptions and works better for skewed data. Z-score can miss outliers in small samples where it inflates standard deviation.

Challenge

Load a real-world messy dataset (e.g., the 2015 Flight Delays dataset). Identify and handle missing values in at least three columns using different strategies, detect and justify handling of outliers in numeric columns, standardize text columns to consistent case and format, and write a function that produces a cleaning summary report with counts of each issue found.

FAQs

Should I remove or impute missing values?

Remove rows with missing values when the missing rate is low (under 5 percent) and data is MCAR (missing completely at random). Impute when missing rate is higher or when dropping would lose too much data. The safest approach is to try both and compare results.

How do I detect outliers in non-numeric data?

For categorical data, look for rare categories or typos using value counts. For text, use string length distributions or pattern matching. For dates, check for future dates, impossibly old dates, or inconsistent formats. Outlier detection in non-numeric data is more heuristic than statistical.

Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.

← Previous Statistical Hypothesis Testing Guide with Python Next → Advanced NumPy Operations — Broadcasting, Vectorization, and Performance

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro

Home Browse Data Science