Databricks Auto-Scaling Not Working Fix
In this tutorial, you'll learn about Databricks Auto. We cover key concepts, practical examples, and best practices.
Your Databricks cluster has auto-scaling enabled but sits at minimum workers during heavy workloads:
Cluster: 2-10 workers
Current: 2 workers
CPU Usage: 95% on all 2 workers (Not scaling up!)
Auto-scaling uses Spark's accumulated task queue depth and executor load metrics. It doesn't scale up if the workload doesn't add enough pending tasks, or if the cluster's Spark configuration limits the maximum cores.
Step-by-Step Fix
1. Check Spark configuration limits
WRONG — Spark limits prevent additional executors:
spark = SparkSession.builder \
.config("spark.executor.instances", "2") \ # Hard limit!
.config("spark.dynamicAllocation.enabled", "false") # Disabled!
.getOrCreate()
RIGHT — enable dynamic allocation:
spark = SparkSession.builder \
.config("spark.dynamicAllocation.enabled", "true") \
.config("spark.dynamicAllocation.minExecutors", "2") \
.config("spark.dynamicAllocation.maxExecutors", "10") \
.config("spark.dynamicAllocation.initialExecutors", "2") \
.getOrCreate()
2. Set appropriate scaling parameters
WRONG — default settings may be too conservative:
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.config("spark.dynamicAllocation.enabled", "true") \
.config("spark.dynamicAllocation.schedulerBacklogTimeout", "1s") \ # How long tasks wait before scaling up
.config("spark.dynamicAllocation.sustainedSchedulerBacklogTimeout", "5s") \
.config("spark.dynamicAllocation.executorIdleTimeout", "60s") \ # How long idle before scaling down
.getOrCreate()
RIGHT — tune for faster scaling:
spark = SparkSession.builder \
.config("spark.dynamicAllocation.enabled", "true") \
.config("spark.dynamicAllocation.schedulerBacklogTimeout", "1s") \
.config("spark.dynamicAllocation.sustainedSchedulerBacklogTimeout", "5s") \
.config("spark.dynamicAllocation.executorIdleTimeout", "30s") \
.getOrCreate()
3. Check auto-scaling in Databricks cluster config
WRONG — auto-scaling disabled:
Cluster > Edit > Worker type: Fixed (not Auto-scaling)
RIGHT — enable auto-scaling:
Cluster > Edit > Worker type: Auto-scaling
Min Workers: 2
Max Workers: 10
Databricks auto-scaling is different from Spark's internal dynamic allocation. Databricks scales the cluster nodes up and down at the infrastructure level.
4. Rebalance data for parallel processing
WRONG — data is skewed, causing low parallelism:
df.groupBy("skewed_column").count() # One partition does all the work
RIGHT — repartition and use salting:
# Increase partition count
df = df.repartition(200)
# Handle skew with salting
from pyspark.sql.functions import col, rand, concat
salted = df.withColumn("salt", (rand() * 10).cast("int"))
salted = salted.withColumn("salted_key", concat(col("key"), col("salt")))
5. Monitor the Spark UI
RIGHT — check these tabs:
Spark UI > Stages:
- Are there pending tasks? If no, scaling won't trigger.
- Is there a long tail of active tasks? Data skew.
- Shuffle read/write size: Large shuffle may cause GC pauses.
Spark UI > Executors:
- Are executors idle? Then scale-down is correct behavior.
- Storage > Memory used: Cache pressure may prevent scaling.
6. Test with a forced scaling scenario
# Force a heavy computation to test scaling
df = spark.range(0, 100000000)
df_with_groups = df.withColumn("group", col("id") % 10000)
result = df_with_groups.groupBy("group").count()
result.collect() # Should trigger auto-scaling
Expected output: the cluster adds workers during heavy computation and removes them during idle periods.
Prevention
- Enable both Databricks and Spark dynamic allocation for elastic scaling.
- Set minExecutors equal to expected baseline load.
- Tune schedulerBacklogTimeout to 1-2s for responsive scaling.
- Use repartition to ensure enough parallel tasks.
- Monitor the Spark UI Executors tab to verify scaling behavior.
Common Mistakes with autoscaling
- Using
returnto exit a function early instead of wrapping a pure value in the monad - Mixing let bindings with <- bindings in do notation, producing type errors
- Overlapping type class instances that cause GHC to reject the program with ambiguous dispatch errors
These mistakes appear frequently in real-world DATABRICKS code. DodaTech's contributors have identified these patterns through analysis of open-source projects and production systems.
Practice Exercise
Write a pure function that safely divides two integers using Maybe, then test it with edge cases like division by zero and negative numbers.
This exercise reinforces the concepts covered in this guide. Try implementing it before checking online solutions.
FAQ
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro