Data Wrangling with Pandas — Reshape, Pivot, and Stack
In this tutorial, you'll learn about Data Wrangling with Pandas. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.
What You'll Learn
Reshape and wrangle data with pandas — pivot tables, melt, stack/unstack, crosstab, and transform between wide and long formats for different analysis needs.
Why It Matters
Data comes in many shapes. Some analyses need wide format (one row per subject), others need long format (one row per observation). Wrangling is the skill of converting between them.
Real-World Use
Creating a monthly sales summary with pivot tables, converting survey responses from wide to long format for analysis, or creating a cross-tabulation of customer segments.
Wide vs Long Format
Wide (one row per subject):
┌─────┬──────┬──────┬──────┐
│ ID │ Q1 │ Q2 │ Q3 │
├─────┼──────┼──────┼──────┤
│ 1 │ Yes │ No │ Yes │
│ 2 │ No │ Yes │ No │
└─────┴──────┴──────┴──────┘
Long (one row per observation):
┌─────┬─────┬──────┐
│ ID │ Q │ Ans │
├─────┼─────┼──────┤
│ 1 │ Q1 │ Yes │
│ 1 │ Q2 │ No │
│ 1 │ Q3 │ Yes │
│ 2 │ Q1 │ No │
└─────┴─────┴──────┘
Pivot — Wide Format
pivot() reshapes from long to wide:
import pandas as pd
# Long format data
df = pd.DataFrame({
"date": ["Jan", "Jan", "Feb", "Feb", "Mar", "Mar"],
"product": ["A", "B", "A", "B", "A", "B"],
"sales": [100, 150, 120, 140, 110, 160],
})
# Pivot to wide
pivoted = df.pivot(index="date", columns="product", values="sales")
# product A B
# date
# Jan 100 150
# Feb 120 140
# Mar 110 160
Pivot Table — With Aggregation
pivot_table() handles duplicates and aggregation:
# If you have duplicate entries
df = pd.DataFrame({
"department": ["Eng", "Eng", "Sales", "Sales", "Eng"],
"year": [2023, 2024, 2023, 2024, 2023],
"salary": [80000, 85000, 60000, 65000, 90000],
})
# pivot_table with mean aggregation
pd.pivot_table(
df,
values="salary",
index="department",
columns="year",
aggfunc="mean",
)
# year 2023 2024
# department
# Eng 85000 85000
# Sales 60000 65000
# Multiple agg functions
pd.pivot_table(
df,
values="salary",
index="department",
columns="year",
aggfunc=["mean", "sum", "count"],
)
# Margins (totals)
pd.pivot_table(
df,
values="salary",
index="department",
columns="year",
aggfunc="mean",
margins=True,
margins_name="Total",
)
Melt — Long Format
melt() reshapes from wide to long:
# Wide format data
df = pd.DataFrame({
"id": [1, 2, 3],
"name": ["Alice", "Bob", "Charlie"],
"math": [85, 92, 78],
"science": [90, 88, 85],
"english": [88, 95, 80],
})
# Melt to long format
melted = pd.melt(
df,
id_vars=["id", "name"],
value_vars=["math", "science", "english"],
var_name="subject",
value_name="score",
)
# id name subject score
# 0 1 Alice math 85
# 1 2 Bob math 92
# 2 3 Charlie math 78
# 3 1 Alice science 90
# ...
This is useful for:
- Plotting with Seaborn (which expects long format)
- Statistical analysis (ANOVA, mixed models)
- Database storage
Stack and Unstack
stack() pivots columns to rows. unstack() pivots rows to columns:
# Multi-level index
df = pd.DataFrame({
"city": ["NYC", "NYC", "LA", "LA"],
"year": [2023, 2024, 2023, 2024],
"population": [8.4, 8.5, 3.8, 3.9],
}).set_index(["city", "year"])
# Unstack: rows → columns
unstacked = df.unstack()
# population
# year 2023 2024
# city
# LA 3.8 3.9
# NYC 8.4 8.5
# Stack: columns → rows
stacked = unstacked.stack()
# city year
# LA 2023 3.8
# 2024 3.9
# NYC 2023 8.4
# 2024 8.5
Crosstab — Frequency Tables
df = pd.DataFrame({
"department": ["Eng", "Sales", "Eng", "Eng", "Sales", "Marketing"],
"gender": ["M", "F", "M", "M", "F", "F"],
"satisfied": ["Yes", "No", "Yes", "No", "Yes", "Yes"],
})
# Simple crosstab
pd.crosstab(df["department"], df["gender"])
# gender F M
# department
# Eng 0 3
# Marketing 1 0
# Sales 2 0
# With margins
pd.crosstab(df["department"], df["gender"], margins=True)
# With values and aggregation
pd.crosstab(
df["department"],
df["gender"],
values=df["satisfied"],
aggfunc=lambda x: (x == "Yes").sum(),
)
Practical Wrangling Pipeline
def wrangle_survey_data(df):
"""Transform survey data from wide to analysis-ready format."""
# Melt to long format
id_cols = ["respondent_id", "age", "gender"]
question_cols = [c for c in df.columns if c.startswith("q")]
long_df = pd.melt(
df,
id_vars=id_cols,
value_vars=question_cols,
var_name="question",
value_name="response",
)
# Pivot table for summary
summary = pd.pivot_table(
long_df,
values="response",
index="question",
columns="gender",
aggfunc=lambda x: (x == "Yes").mean(), # % Yes
)
return long_df, summary
# Usage
long_df, summary = wrangle_survey_data(survey_df)
Quick Reference
| Function | What It Does | Analogous To |
|---|---|---|
pivot() |
Reshape (no aggregation) | Spread (R) |
pivot_table() |
Reshape + aggregate | Pivot table (Excel) |
melt() |
Wide → Long | Gather (R) |
stack() |
Columns → Rows | — |
unstack() |
Rows → Columns | — |
crosstab() |
Frequency table | Contingency table |
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro