Pandas Data Manipulation Guide — Filter, Group, and Transform

DodaTech 3 min read

In this tutorial, you will learn pandas data manipulation techniques including filtering, grouping, aggregation, custom transformations with apply, and reshaping between wide and long formats.

What You'll Learn

Use Pandas to filter DataFrames with boolean indexing, group data for aggregation, apply custom functions, pivot and melt tables, and chain operations for clean analysis pipelines.

Why It Matters

Raw data is never analysis-ready. Data manipulation accounts for 60-80 percent of a data scientist's work. Mastering these operations separates efficient analysts from those who struggle with messy data.

Real-World Use

A financial analyst filters daily transactions for fraud flags, groups by merchant category, computes average fraud amount per category, and pivots the results to compare monthly trends across merchant types.

Data Manipulation Workflow

flowchart LR
  A[Raw DataFrame] --> B[Filter Rows]
  A --> C[Select Columns]
  B --> D[Group By]
  C --> D
  D --> E[Aggregate]
  E --> F[Transform]
  F --> G[Reshape]
  G --> H[Clean Output]

Filtering and Conditional Selection

import pandas as pd
import numpy as np

df = pd.DataFrame({
    "product": ["A", "B", "C", "D", "E"],
    "category": ["Electronics", "Clothing", "Electronics", "Food", "Clothing"],
    "price": [299, 49, 199, 5, 89],
    "units_sold": [150, 400, 220, 980, 310],
    "revenue": [44850, 19600, 43780, 4900, 27590],
})

expensive = df[df["price"] > 100]
print(expensive)

electronics_high = df[(df["category"] == "Electronics") & (df["units_sold"] > 150)]
print(electronics_high)

Output:

  product     category  price  units_sold  revenue
0       A  Electronics    299         150    44850
2       C  Electronics    199         220    43780

  product     category  price  units_sold  revenue
2       C  Electronics    199         220    43780

Grouping and Multi-Function Aggregation

grouped = df.groupby("category").agg({
    "price": ["mean", "min", "max"],
    "units_sold": "sum",
    "revenue": "sum",
})
print(grouped)

grouped.columns = ["avg_price", "min_price", "max_price", "total_units", "total_revenue"]
grouped = grouped.reset_index()
print(grouped)

Output:

               price               units_sold  revenue
                mean min max        sum      sum
category
Clothing      69.0  49  89        710    47190
Electronics  249.0 199 299        370    88630
Food           5.0   5   5        980     4900

  category  avg_price  min_price  max_price  total_units  total_revenue
0  Clothing       69.0         49         89          710          47190
1 Electronics    249.0        199        299          370          88630
2       Food       5.0          5          5          980           4900

Custom Transformations with Apply

def revenue_tier(rev):
    if rev > 40000:
        return "High"
    elif rev > 20000:
        return "Medium"
    else:
        return "Low"

df["revenue_tier"] = df["revenue"].apply(revenue_tier)

df["price_per_unit"] = df.apply(
    lambda row: round(row["revenue"] / row["units_sold"], 2), axis=1
)
print(df[["product", "revenue", "revenue_tier", "price_per_unit"]])

Output:

  product  revenue revenue_tier  price_per_unit
0       A    44850         High          299.00
1       B    19600          Low           49.00
2       C    43780         High          199.00
3       D     4900          Low            5.00
4       E    27590       Medium           89.00

Practice Questions

What is the difference between df[df["col"] > 5] and df.query("col > 5")?
How do you apply different aggregation functions to different columns with groupby?
Explain how apply differs from transform when used with a groupby object.

Answers:

Both filter rows. query uses string expression syntax and can be faster on large DataFrames, while bracket filtering uses native Python boolean indexing.
Pass a dictionary to agg() mapping column names to functions, e.g., df.groupby("cat").agg({"sales": "sum", "price": "mean"}).
apply returns a DataFrame with the same shape as the grouped result. transform returns a DataFrame with the same shape as the original, broadcasting the aggregated value back to each row.

Challenge

Use the tips dataset from Seaborn. Group by day and compute total bill mean, tip mean, and count per day. Then add a column showing what percentage each day's total bill contributes to the weekly total. Finally, pivot the result so days are columns and statistics are rows.

FAQs

When should I use apply vs a vectorized operation?

Always prefer vectorized operations (df["col"] * 2) over apply when possible. Vectorized operations run in C, are orders of magnitude faster, and consume less memory. Use apply only when you need a custom function that cannot be expressed vectorially.

How do I chain multiple pandas operations cleanly?

Use the .pipe() method to chain functions that take a DataFrame as the first argument. For built-in methods, use method chaining directly: df.query("price > 10").groupby("cat").agg({"sales": "sum"}).reset_index().

Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.

← Previous Data Visualization with Matplotlib and Seaborn — Complete Guide Next → Time Series Analysis with Python — Complete Guide

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro

Home Browse Data Science