Data Wrangling with Pandas — Reshape, Pivot, and Stack

DodaTech 3 min read

In this tutorial, you'll learn about Data Wrangling with Pandas. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.

What You'll Learn

Reshape and wrangle data with pandas — pivot tables, melt, stack/unstack, crosstab, and transform between wide and long formats for different analysis needs.

Why It Matters

Data comes in many shapes. Some analyses need wide format (one row per subject), others need long format (one row per observation). Wrangling is the skill of converting between them.

Real-World Use

Creating a monthly sales summary with pivot tables, converting survey responses from wide to long format for analysis, or creating a cross-tabulation of customer segments.

Wide vs Long Format

Wide (one row per subject):
┌─────┬──────┬──────┬──────┐
│ ID  │ Q1   │ Q2   │ Q3   │
├─────┼──────┼──────┼──────┤
│ 1   │ Yes  │ No   │ Yes  │
│ 2   │ No   │ Yes  │ No   │
└─────┴──────┴──────┴──────┘

Long (one row per observation):
┌─────┬─────┬──────┐
│ ID  │ Q   │ Ans  │
├─────┼─────┼──────┤
│ 1   │ Q1  │ Yes  │
│ 1   │ Q2  │ No   │
│ 1   │ Q3  │ Yes  │
│ 2   │ Q1  │ No   │
└─────┴─────┴──────┘

Pivot — Wide Format

pivot() reshapes from long to wide:

import pandas as pd

# Long format data
df = pd.DataFrame({
    "date": ["Jan", "Jan", "Feb", "Feb", "Mar", "Mar"],
    "product": ["A", "B", "A", "B", "A", "B"],
    "sales": [100, 150, 120, 140, 110, 160],
})

# Pivot to wide
pivoted = df.pivot(index="date", columns="product", values="sales")
# product    A    B
# date
# Jan      100  150
# Feb      120  140
# Mar      110  160

Pivot Table — With Aggregation

pivot_table() handles duplicates and aggregation:

# If you have duplicate entries
df = pd.DataFrame({
    "department": ["Eng", "Eng", "Sales", "Sales", "Eng"],
    "year": [2023, 2024, 2023, 2024, 2023],
    "salary": [80000, 85000, 60000, 65000, 90000],
})

# pivot_table with mean aggregation
pd.pivot_table(
    df,
    values="salary",
    index="department",
    columns="year",
    aggfunc="mean",
)
# year          2023    2024
# department
# Eng          85000   85000
# Sales        60000   65000

# Multiple agg functions
pd.pivot_table(
    df,
    values="salary",
    index="department",
    columns="year",
    aggfunc=["mean", "sum", "count"],
)

# Margins (totals)
pd.pivot_table(
    df,
    values="salary",
    index="department",
    columns="year",
    aggfunc="mean",
    margins=True,
    margins_name="Total",
)

Melt — Long Format

melt() reshapes from wide to long:

# Wide format data
df = pd.DataFrame({
    "id": [1, 2, 3],
    "name": ["Alice", "Bob", "Charlie"],
    "math": [85, 92, 78],
    "science": [90, 88, 85],
    "english": [88, 95, 80],
})

# Melt to long format
melted = pd.melt(
    df,
    id_vars=["id", "name"],
    value_vars=["math", "science", "english"],
    var_name="subject",
    value_name="score",
)
#     id     name  subject  score
# 0    1    Alice     math     85
# 1    2      Bob     math     92
# 2    3  Charlie     math     78
# 3    1    Alice  science     90
# ...

This is useful for:

Plotting with Seaborn (which expects long format)
Statistical analysis (ANOVA, mixed models)
Database storage

Stack and Unstack

stack() pivots columns to rows. unstack() pivots rows to columns:

# Multi-level index
df = pd.DataFrame({
    "city": ["NYC", "NYC", "LA", "LA"],
    "year": [2023, 2024, 2023, 2024],
    "population": [8.4, 8.5, 3.8, 3.9],
}).set_index(["city", "year"])

# Unstack: rows → columns
unstacked = df.unstack()
#         population
# year         2023  2024
# city
# LA            3.8   3.9
# NYC           8.4   8.5

# Stack: columns → rows
stacked = unstacked.stack()
# city  year
# LA    2023    3.8
#       2024    3.9
# NYC   2023    8.4
#       2024    8.5

Crosstab — Frequency Tables

df = pd.DataFrame({
    "department": ["Eng", "Sales", "Eng", "Eng", "Sales", "Marketing"],
    "gender": ["M", "F", "M", "M", "F", "F"],
    "satisfied": ["Yes", "No", "Yes", "No", "Yes", "Yes"],
})

# Simple crosstab
pd.crosstab(df["department"], df["gender"])
# gender      F  M
# department
# Eng         0  3
# Marketing   1  0
# Sales       2  0

# With margins
pd.crosstab(df["department"], df["gender"], margins=True)

# With values and aggregation
pd.crosstab(
    df["department"],
    df["gender"],
    values=df["satisfied"],
    aggfunc=lambda x: (x == "Yes").sum(),
)

Practical Wrangling Pipeline

def wrangle_survey_data(df):
    """Transform survey data from wide to analysis-ready format."""

    # Melt to long format
    id_cols = ["respondent_id", "age", "gender"]
    question_cols = [c for c in df.columns if c.startswith("q")]

    long_df = pd.melt(
        df,
        id_vars=id_cols,
        value_vars=question_cols,
        var_name="question",
        value_name="response",
    )

    # Pivot table for summary
    summary = pd.pivot_table(
        long_df,
        values="response",
        index="question",
        columns="gender",
        aggfunc=lambda x: (x == "Yes").mean(),  # % Yes
    )

    return long_df, summary

# Usage
long_df, summary = wrangle_survey_data(survey_df)

Quick Reference

Function	What It Does	Analogous To
`pivot()`	Reshape (no aggregation)	Spread (R)
`pivot_table()`	Reshape + aggregate	Pivot table (Excel)
`melt()`	Wide → Long	Gather (R)
`stack()`	Columns → Rows	—
`unstack()`	Rows → Columns	—
`crosstab()`	Frequency table	Contingency table

← Previous Seaborn Tutorial — Statistical Data Visualization Next → Time Series Analysis with Pandas

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro

Home Browse Data Science