Apache Spark Out of Memory Fix
In this tutorial, you'll learn about Apache Spark Out of Memory Fix. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.
Your Spark job fails with java.lang.OutOfMemoryError: Java heap space or Container killed by YARN for exceeding memory limits — the executor ran out of memory due to large partitions, data skew, or inefficient Serialization.
Step-by-Step Fix
1. Check the current memory configuration
spark-submit --conf spark.executor.memory=2g --conf spark.driver.memory=2g my_job.py
2. Increase executor memory
# Wrong — default memory (1g) is too low for large datasets
spark = SparkSession.builder.appName("myapp").getOrCreate()
# Right — allocate more memory with overhead
spark = SparkSession.builder \
.appName("myapp") \
.config("spark.executor.memory", "4g") \
.config("spark.executor.memoryOverhead", "1g") \
.config("spark.driver.memory", "4g") \
.getOrCreate()
3. Handle data skew with salting
# Wrong — full shuffle on skewed key
df.groupBy("city").count().show()
# Right — add salt to distribute the load
from pyspark.sql.functions import col, lit, concat, rand
salted = df.withColumn("salted_key",
concat(col("city"), lit("_"), (rand() * 10).cast("int")))
counts = salted.groupBy("salted_key").count()
4. Use broadcast joins for small tables
# Wrong — causes large shuffle
result = large_df.join(small_df, "key")
# Right — broadcast the small table
from pyspark.sql.functions import broadcast
result = large_df.join(broadcast(small_df), "key")
Common Mistakes
| Mistake | Fix |
|---|---|
| Too few partitions | Repartition with df.repartition(200) |
| Too many partitions causing overhead | Coalesce with df.coalesce(50) |
Using groupBy on highly skewed column |
Use salting or bucketing to distribute data |
| Kryo Serialization not enabled | Set spark.serializer=org.apache.spark.serializer.KryoSerializer |
| Disk spill due to insufficient memory | Increase spark.executor.memory and spark.shuffle.memoryFraction |
Prevention
- Monitor Spark UI's Storage and Executors tabs for memory usage.
- Use column pruning and filter pushdown to reduce data volume.
- Choose appropriate cluster size — 5 executors with 4g each is better than 1 executor with 20g.
- Set
spark.dynamicAllocation.enabled=truefor variable workloads.
DodaTech Tools
Doda Browser's Spark profiler visualizes executor memory usage and identifies memory-intensive stages. DodaZIP compresses and archives Spark event logs for offline analysis. Durga Antivirus Pro monitors for abnormal memory patterns that could indicate resource abuse.
Common Mistakes with spark oom
- Using
foldlinstead offoldl'causing stack overflow on large lists - Forgetting
deriving (Show, Eq)on custom data types needed for debugging - Placing the wildcard pattern first in case expressions, making all subsequent patterns unreachable
These mistakes appear frequently in real-world APACHE code. DodaTech's contributors have identified these patterns through analysis of open-source projects and production systems.
Practice Exercise
Write a pure function that safely divides two integers using Maybe, then test it with edge cases like division by zero and negative numbers.
This exercise reinforces the concepts covered in this guide. Try implementing it before checking online solutions.
FAQ
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro