Pandas GroupBy and Aggregation — Analyze Data by Categories
In this tutorial, you'll learn about Pandas GroupBy and Aggregation. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.
What You'll Learn
Use pandas GroupBy to split data into groups, apply aggregation functions (sum, mean, count), and analyze patterns within categories.
Why It Matters
GroupBy is the most powerful pandas feature for data analysis. It answers questions like "what's the average salary per department?" or "total sales by region."
Real-World Use
Calculating monthly revenue per product category, finding the average customer age by region, or counting orders per sales rep.
Basic GroupBy
import pandas as pd
df = pd.DataFrame({
"department": ["Engineering", "Sales", "Engineering", "Sales", "Marketing"],
"employee": ["Alice", "Bob", "Charlie", "Diana", "Eve"],
"salary": [80000, 60000, 90000, 55000, 65000],
"years_exp": [5, 3, 8, 2, 4],
})
# Group by department and calculate mean
df.groupby("department")["salary"].mean()
# department
# Engineering 85000
# Marketing 65000
# Sales 57500
What Happens During GroupBy
Original DataFrame
┌─────────────┬───────┐
│ Engineering │ 80000 │ Split
│ Sales │ 60000 │ into
│ Engineering │ 90000 │ groups
│ Sales │ 55000 │ by dept
│ Marketing │ 65000 │
└─────────────┴───────┘
Engineering: [80000, 90000]
Sales: [60000, 55000]
Marketing: [65000]
│ Apply function
▼
Engineering: 85000 (mean)
Sales: 57500 (mean)
Marketing: 65000 (mean)
Multiple Aggregations
# Multiple functions on one column
df.groupby("department")["salary"].agg(["mean", "sum", "count", "std"])
# mean sum count std
# Engineering 85000 170000 2 7071.0678
# Marketing 65000 65000 1 NaN
# Sales 57500 115000 2 3535.5339
# Different functions per column
df.groupby("department").agg({
"salary": ["mean", "max"],
"years_exp": "mean",
})
# salary years_exp
# mean max mean
# Engineering 85000 90000 6.500000
# Marketing 65000 65000 4.000000
# Sales 57500 60000 2.500000
Common Aggregation Functions
| Function | What It Calculates |
|---|---|
sum() |
Total |
mean() |
Average |
median() |
Middle value |
min() / max() |
Min / max |
count() |
Number of non-null values |
nunique() |
Number of unique values |
std() / var() |
Standard deviation / variance |
first() / last() |
First / last value in group |
describe() |
All statistics |
Grouping by Multiple Columns
df = pd.read_csv("sales.csv")
# Group by multiple columns
df.groupby(["region", "product_category"])["revenue"].sum()
# Unstack to create a pivot-like table
df.groupby(["region", "product_category"])["revenue"].sum().unstack()
# product_category Electronics Clothing Food
# region
# North 50000 30000 20000
# South 40000 25000 35000
Transform — Per-Group Operations
transform() returns a Series with the same index as the original:
# Add a column with the department average
df["dept_avg_salary"] = df.groupby("department")["salary"].transform("mean")
# Calculate how each salary compares to department average
df["salary_vs_dept_avg"] = df["salary"] - df["dept_avg_salary"]
print(df[["employee", "department", "salary", "salary_vs_dept_avg"]])
# employee department salary salary_vs_dept_avg
# 0 Alice Engineering 80000 -5000
# 1 Bob Sales 60000 2500
# 2 Charlie Engineering 90000 5000
# 3 Diana Sales 55000 -2500
# 4 Eve Marketing 65000 0
Filter — Keep or Remove Groups
# Keep departments with total salary > 100000
df.groupby("department").filter(lambda x: x["salary"].sum() > 100000)
# department employee salary years_exp
# 0 Engineering Alice 80000 5
# 2 Engineering Charlie 90000 8
# 1 Sales Bob 60000 3
# 3 Sales Diana 55000 2
Apply — Custom Functions
# Apply a custom function per group
def top_earner(group):
return group.nlargest(1, "salary")
df.groupby("department").apply(top_earner)
# department employee salary years_exp
# department
# Engineering 2 Engineering Charlie 90000 8
# Marketing 4 Marketing Eve 65000 4
# Sales 1 Sales Bob 60000 3
Working with Index After GroupBy
# GroupBy result has a MultiIndex
result = df.groupby(["department"])["salary"].agg(["mean", "count"])
print(result.index)
# Index(['Engineering', 'Marketing', 'Sales'], dtype='object')
# Reset to regular columns
result.reset_index(inplace=True)
# department mean count
# 0 Engineering 85000 2
# 1 Marketing 65000 1
# 2 Sales 57500 2
Practice
# 1. Group by a categorical column
# 2. Calculate mean, sum, and count per group
# 3. Group by two columns and unstack
# 4. Use transform to add a group-level metric
# 5. Use filter to remove small groups
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro