Pandas GroupBy and Aggregation — Analyze Data by Categories

DodaTech 3 min read

In this tutorial, you'll learn about Pandas GroupBy and Aggregation. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.

What You'll Learn

Use pandas GroupBy to split data into groups, apply aggregation functions (sum, mean, count), and analyze patterns within categories.

Why It Matters

GroupBy is the most powerful pandas feature for data analysis. It answers questions like "what's the average salary per department?" or "total sales by region."

Real-World Use

Calculating monthly revenue per product category, finding the average customer age by region, or counting orders per sales rep.

Basic GroupBy

import pandas as pd

df = pd.DataFrame({
    "department": ["Engineering", "Sales", "Engineering", "Sales", "Marketing"],
    "employee": ["Alice", "Bob", "Charlie", "Diana", "Eve"],
    "salary": [80000, 60000, 90000, 55000, 65000],
    "years_exp": [5, 3, 8, 2, 4],
})

# Group by department and calculate mean
df.groupby("department")["salary"].mean()
# department
# Engineering    85000
# Marketing      65000
# Sales          57500

What Happens During GroupBy

Original DataFrame
┌─────────────┬───────┐
│ Engineering │ 80000 │     Split
│ Sales       │ 60000 │     into
│ Engineering │ 90000 │     groups
│ Sales       │ 55000 │     by dept
│ Marketing   │ 65000 │
└─────────────┴───────┘
    
    Engineering: [80000, 90000]
    Sales:       [60000, 55000]
    Marketing:   [65000]

           │ Apply function
           ▼

    Engineering: 85000 (mean)
    Sales:       57500 (mean)
    Marketing:   65000 (mean)

Multiple Aggregations

# Multiple functions on one column
df.groupby("department")["salary"].agg(["mean", "sum", "count", "std"])
#               mean    sum  count          std
# Engineering  85000  170000      2   7071.0678
# Marketing    65000   65000      1         NaN
# Sales        57500  115000      2   3535.5339

# Different functions per column
df.groupby("department").agg({
    "salary": ["mean", "max"],
    "years_exp": "mean",
})
#           salary           years_exp
#             mean    max       mean
# Engineering  85000  90000  6.500000
# Marketing    65000  65000  4.000000
# Sales        57500  60000  2.500000

Common Aggregation Functions

Function	What It Calculates
`sum()`	Total
`mean()`	Average
`median()`	Middle value
`min()` / `max()`	Min / max
`count()`	Number of non-null values
`nunique()`	Number of unique values
`std()` / `var()`	Standard deviation / variance
`first()` / `last()`	First / last value in group
`describe()`	All statistics

Grouping by Multiple Columns

df = pd.read_csv("sales.csv")

# Group by multiple columns
df.groupby(["region", "product_category"])["revenue"].sum()

# Unstack to create a pivot-like table
df.groupby(["region", "product_category"])["revenue"].sum().unstack()
# product_category  Electronics  Clothing  Food
# region
# North                    50000     30000  20000
# South                    40000     25000  35000

Transform — Per-Group Operations

transform() returns a Series with the same index as the original:

# Add a column with the department average
df["dept_avg_salary"] = df.groupby("department")["salary"].transform("mean")

# Calculate how each salary compares to department average
df["salary_vs_dept_avg"] = df["salary"] - df["dept_avg_salary"]

print(df[["employee", "department", "salary", "salary_vs_dept_avg"]])
#   employee   department  salary  salary_vs_dept_avg
# 0    Alice  Engineering   80000              -5000
# 1      Bob        Sales   60000               2500
# 2  Charlie  Engineering   90000               5000
# 3    Diana        Sales   55000              -2500
# 4      Eve    Marketing   65000                  0

Filter — Keep or Remove Groups

# Keep departments with total salary > 100000
df.groupby("department").filter(lambda x: x["salary"].sum() > 100000)
#   department   employee  salary  years_exp
# 0 Engineering     Alice   80000          5
# 2 Engineering   Charlie   90000          8
# 1      Sales       Bob   60000          3
# 3      Sales     Diana   55000          2

Apply — Custom Functions

# Apply a custom function per group
def top_earner(group):
    return group.nlargest(1, "salary")

df.groupby("department").apply(top_earner)
#                   department employee  salary  years_exp
# department
# Engineering 2  Engineering Charlie   90000          8
# Marketing   4    Marketing     Eve   65000          4
# Sales       1        Sales     Bob   60000          3

Working with Index After GroupBy

# GroupBy result has a MultiIndex
result = df.groupby(["department"])["salary"].agg(["mean", "count"])
print(result.index)
# Index(['Engineering', 'Marketing', 'Sales'], dtype='object')

# Reset to regular columns
result.reset_index(inplace=True)
#     department    mean  count
# 0  Engineering  85000      2
# 1    Marketing  65000      1
# 2       Sales   57500      2

Practice

# 1. Group by a categorical column
# 2. Calculate mean, sum, and count per group
# 3. Group by two columns and unstack
# 4. Use transform to add a group-level metric
# 5. Use filter to remove small groups

← Previous Pandas Data Cleaning — Handling Missing Data and Duplicates Next → Pandas Merge, Join, and Concatenate — Combining DataFrames

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro

Home Browse Data Science