Pandas Merge, Join, and Concatenate — Combining DataFrames

DodaTech 3 min read

In this tutorial, you'll learn about Pandas Merge, Join, and Concatenate. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.

What You'll Learn

Combine multiple DataFrames using merge, join, and concat — SQL-style joins (inner, outer, left, right), concatenating rows and columns, and handling mismatched data.

Why It Matters

Real-world data comes from multiple sources. Merging and joining is how you combine customer tables with orders, or sales data with product information.

Real-World Use

Joining a customer table with an orders table on customer_id, appending monthly sales reports into one DataFrame, or combining product details with inventory levels.

Concatenation — Stacking DataFrames

pd.concat() stacks DataFrames vertically (rows) or horizontally (columns):

Vertical (Appending Rows)

import pandas as pd

df1 = pd.DataFrame({"id": [1, 2], "name": ["Alice", "Bob"]})
df2 = pd.DataFrame({"id": [3, 4], "name": ["Charlie", "Diana"]})

# Stack rows
combined = pd.concat([df1, df2], ignore_index=True)
#    id     name
# 0   1    Alice
# 1   2      Bob
# 2   3  Charlie
# 3   4    Diana

Horizontal (Adding Columns)

df_left = pd.DataFrame({"id": [1, 2, 3], "name": ["Alice", "Bob", "Charlie"]})
df_right = pd.DataFrame({"age": [25, 30, 35], "salary": [50000, 60000, 70000]})

combined = pd.concat([df_left, df_right], axis=1)
#    id     name  age  salary
# 0   1    Alice   25   50000
# 1   2      Bob   30   60000
# 2   3  Charlie   35   70000

Merge — SQL-Style Joins

pd.merge() combines DataFrames based on a key column:

customers = pd.DataFrame({
    "customer_id": [1, 2, 3, 4],
    "name": ["Alice", "Bob", "Charlie", "Diana"],
    "city": ["NYC", "LA", "Chicago", "NYC"],
})

orders = pd.DataFrame({
    "order_id": [101, 102, 103, 104],
    "customer_id": [1, 2, 2, 5],
    "amount": [100, 200, 150, 300],
})

Inner Join

Only rows with matching keys in both DataFrames:

result = pd.merge(customers, orders, on="customer_id", how="inner")
#    customer_id     name      city  order_id  amount
# 0            1    Alice       NYC       101     100
# 1            2      Bob        LA       102     200
# 2            2      Bob        LA       103     150

Left Join

All rows from left (customers), matching rows from right:

result = pd.merge(customers, orders, on="customer_id", how="left")
#    customer_id     name      city  order_id  amount
# 0            1    Alice       NYC     101.0   100.0
# 1            2      Bob        LA     102.0   200.0
# 2            2      Bob        LA     103.0   150.0
# 3            3  Charlie    Chicago       NaN     NaN
# 4            4    Diana       NYC       NaN     NaN

Right Join

All rows from right, matching rows from left:

result = pd.merge(customers, orders, on="customer_id", how="right")
#    customer_id     name   city  order_id  amount
# 0          1.0    Alice    NYC       101     100
# 1          2.0      Bob     LA       102     200
# 2          2.0      Bob     LA       103     150
# 3          5.0      NaN    NaN       104     300

Outer Join

All rows from both DataFrames:

result = pd.merge(customers, orders, on="customer_id", how="outer")
#    customer_id     name      city  order_id  amount
# 0          1.0    Alice       NYC     101.0   100.0
# 1          2.0      Bob        LA     102.0   200.0
# 2          2.0      Bob        LA     103.0   150.0
# 3          3.0  Charlie    Chicago       NaN     NaN
# 4          4.0    Diana       NYC       NaN     NaN
# 5          5.0      NaN       NaN     104.0   300.0

Merge on Different Column Names

customers = pd.DataFrame({
    "cust_id": [1, 2, 3],
    "name": ["Alice", "Bob", "Charlie"],
})

orders = pd.DataFrame({
    "order_id": [101, 102, 103],
    "customer_id": [1, 2, 4],
    "amount": [100, 200, 300],
})

result = pd.merge(
    customers, orders,
    left_on="cust_id",
    right_on="customer_id",
    how="left",
)

Join — Merge on Index

df1 = pd.DataFrame({"name": ["Alice", "Bob"]}, index=[1, 2])
df2 = pd.DataFrame({"age": [25, 30]}, index=[1, 2])

result = df1.join(df2)
#     name  age
# 1  Alice   25
# 2    Bob   30

Handling Overlapping Columns

# If both DataFrames have the same column name
df1 = pd.DataFrame({"id": [1, 2], "value": [10, 20]})
df2 = pd.DataFrame({"id": [1, 2], "value": [30, 40]})

# Use suffixes
result = pd.merge(df1, df2, on="id", suffixes=("_left", "_right"))
#    id  value_left  value_right
# 0   1          10           30
# 1   2          20           40

Comparison Table

Method	What It Does	SQL Equivalent
`concat(axis=0)`	Stack rows	`UNION ALL`
`concat(axis=1)`	Stack columns	—
`merge(how="inner")`	Only matching keys	`INNER JOIN`
`merge(how="left")`	Keep all left rows	`LEFT JOIN`
`merge(how="right")`	Keep all right rows	`RIGHT JOIN`
`merge(how="outer")`	Keep all rows	`FULL OUTER JOIN`
`join()`	Merge on index	—

Practice

# 1. Create two DataFrames with a shared key column
# 2. Perform inner, left, right, and outer joins
# 3. Observe which rows are kept and which get NaN
# 4. Use concat to combine monthly data
# 5. Handle overlapping column names with suffixes

← Previous Pandas GroupBy and Aggregation — Analyze Data by Categories Next → NumPy Tutorial — Arrays and Operations

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro

Home Browse Data Science