Pandas Data Manipulation Guide — Filter, Group, and Transform
In this tutorial, you will learn pandas data manipulation techniques including filtering, grouping, aggregation, custom transformations with apply, and reshaping between wide and long formats.
What You'll Learn
Use Pandas to filter DataFrames with boolean indexing, group data for aggregation, apply custom functions, pivot and melt tables, and chain operations for clean analysis pipelines.
Why It Matters
Raw data is never analysis-ready. Data manipulation accounts for 60-80 percent of a data scientist's work. Mastering these operations separates efficient analysts from those who struggle with messy data.
Real-World Use
A financial analyst filters daily transactions for fraud flags, groups by merchant category, computes average fraud amount per category, and pivots the results to compare monthly trends across merchant types.
Data Manipulation Workflow
flowchart LR A[Raw DataFrame] --> B[Filter Rows] A --> C[Select Columns] B --> D[Group By] C --> D D --> E[Aggregate] E --> F[Transform] F --> G[Reshape] G --> H[Clean Output]
Filtering and Conditional Selection
import pandas as pd
import numpy as np
df = pd.DataFrame({
"product": ["A", "B", "C", "D", "E"],
"category": ["Electronics", "Clothing", "Electronics", "Food", "Clothing"],
"price": [299, 49, 199, 5, 89],
"units_sold": [150, 400, 220, 980, 310],
"revenue": [44850, 19600, 43780, 4900, 27590],
})
expensive = df[df["price"] > 100]
print(expensive)
electronics_high = df[(df["category"] == "Electronics") & (df["units_sold"] > 150)]
print(electronics_high)
Output:
product category price units_sold revenue
0 A Electronics 299 150 44850
2 C Electronics 199 220 43780
product category price units_sold revenue
2 C Electronics 199 220 43780
Grouping and Multi-Function Aggregation
grouped = df.groupby("category").agg({
"price": ["mean", "min", "max"],
"units_sold": "sum",
"revenue": "sum",
})
print(grouped)
grouped.columns = ["avg_price", "min_price", "max_price", "total_units", "total_revenue"]
grouped = grouped.reset_index()
print(grouped)
Output:
price units_sold revenue
mean min max sum sum
category
Clothing 69.0 49 89 710 47190
Electronics 249.0 199 299 370 88630
Food 5.0 5 5 980 4900
category avg_price min_price max_price total_units total_revenue
0 Clothing 69.0 49 89 710 47190
1 Electronics 249.0 199 299 370 88630
2 Food 5.0 5 5 980 4900
Custom Transformations with Apply
def revenue_tier(rev):
if rev > 40000:
return "High"
elif rev > 20000:
return "Medium"
else:
return "Low"
df["revenue_tier"] = df["revenue"].apply(revenue_tier)
df["price_per_unit"] = df.apply(
lambda row: round(row["revenue"] / row["units_sold"], 2), axis=1
)
print(df[["product", "revenue", "revenue_tier", "price_per_unit"]])
Output:
product revenue revenue_tier price_per_unit
0 A 44850 High 299.00
1 B 19600 Low 49.00
2 C 43780 High 199.00
3 D 4900 Low 5.00
4 E 27590 Medium 89.00
Practice Questions
- What is the difference between
df[df["col"] > 5]anddf.query("col > 5")? - How do you apply different aggregation functions to different columns with
groupby? - Explain how
applydiffers fromtransformwhen used with a groupby object.
Answers:
- Both filter rows.
queryuses string expression syntax and can be faster on large DataFrames, while bracket filtering uses native Python boolean indexing. - Pass a dictionary to
agg()mapping column names to functions, e.g.,df.groupby("cat").agg({"sales": "sum", "price": "mean"}). applyreturns a DataFrame with the same shape as the grouped result.transformreturns a DataFrame with the same shape as the original, broadcasting the aggregated value back to each row.
Challenge
Use the tips dataset from Seaborn. Group by day and compute total bill mean, tip mean, and count per day. Then add a column showing what percentage each day's total bill contributes to the weekly total. Finally, pivot the result so days are columns and statistics are rows.
FAQs
Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro