Skip to content

Databricks Auto-Scaling Not Working Fix

DodaTech Updated 2026-06-24 3 min read

In this tutorial, you'll learn about Databricks Auto. We cover key concepts, practical examples, and best practices.

Your Databricks cluster has auto-scaling enabled but sits at minimum workers during heavy workloads:

Cluster: 2-10 workers
Current: 2 workers
CPU Usage: 95% on all 2 workers (Not scaling up!)

Auto-scaling uses Spark's accumulated task queue depth and executor load metrics. It doesn't scale up if the workload doesn't add enough pending tasks, or if the cluster's Spark configuration limits the maximum cores.

Step-by-Step Fix

1. Check Spark configuration limits

WRONG — Spark limits prevent additional executors:

spark = SparkSession.builder \
    .config("spark.executor.instances", "2") \  # Hard limit!
    .config("spark.dynamicAllocation.enabled", "false")  # Disabled!
    .getOrCreate()

RIGHT — enable dynamic allocation:

spark = SparkSession.builder \
    .config("spark.dynamicAllocation.enabled", "true") \
    .config("spark.dynamicAllocation.minExecutors", "2") \
    .config("spark.dynamicAllocation.maxExecutors", "10") \
    .config("spark.dynamicAllocation.initialExecutors", "2") \
    .getOrCreate()

2. Set appropriate scaling parameters

WRONG — default settings may be too conservative:

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .config("spark.dynamicAllocation.enabled", "true") \
    .config("spark.dynamicAllocation.schedulerBacklogTimeout", "1s") \  # How long tasks wait before scaling up
    .config("spark.dynamicAllocation.sustainedSchedulerBacklogTimeout", "5s") \
    .config("spark.dynamicAllocation.executorIdleTimeout", "60s") \  # How long idle before scaling down
    .getOrCreate()

RIGHT — tune for faster scaling:

spark = SparkSession.builder \
    .config("spark.dynamicAllocation.enabled", "true") \
    .config("spark.dynamicAllocation.schedulerBacklogTimeout", "1s") \
    .config("spark.dynamicAllocation.sustainedSchedulerBacklogTimeout", "5s") \
    .config("spark.dynamicAllocation.executorIdleTimeout", "30s") \
    .getOrCreate()

3. Check auto-scaling in Databricks cluster config

WRONG — auto-scaling disabled:

Cluster > Edit > Worker type: Fixed (not Auto-scaling)

RIGHT — enable auto-scaling:

Cluster > Edit > Worker type: Auto-scaling
Min Workers: 2
Max Workers: 10

Databricks auto-scaling is different from Spark's internal dynamic allocation. Databricks scales the cluster nodes up and down at the infrastructure level.

4. Rebalance data for parallel processing

WRONG — data is skewed, causing low parallelism:

df.groupBy("skewed_column").count()  # One partition does all the work

RIGHT — repartition and use salting:

# Increase partition count
df = df.repartition(200)

# Handle skew with salting
from pyspark.sql.functions import col, rand, concat
salted = df.withColumn("salt", (rand() * 10).cast("int"))
salted = salted.withColumn("salted_key", concat(col("key"), col("salt")))

5. Monitor the Spark UI

RIGHT — check these tabs:

Spark UI > Stages:
- Are there pending tasks? If no, scaling won't trigger.
- Is there a long tail of active tasks? Data skew.
- Shuffle read/write size: Large shuffle may cause GC pauses.

Spark UI > Executors:
- Are executors idle? Then scale-down is correct behavior.
- Storage > Memory used: Cache pressure may prevent scaling.

6. Test with a forced scaling scenario

# Force a heavy computation to test scaling
df = spark.range(0, 100000000)
df_with_groups = df.withColumn("group", col("id") % 10000)
result = df_with_groups.groupBy("group").count()
result.collect()  # Should trigger auto-scaling

Expected output: the cluster adds workers during heavy computation and removes them during idle periods.

Prevention

  • Enable both Databricks and Spark dynamic allocation for elastic scaling.
  • Set minExecutors equal to expected baseline load.
  • Tune schedulerBacklogTimeout to 1-2s for responsive scaling.
  • Use repartition to ensure enough parallel tasks.
  • Monitor the Spark UI Executors tab to verify scaling behavior.

Common Mistakes with autoscaling

  1. Using return to exit a function early instead of wrapping a pure value in the monad
  2. Mixing let bindings with <- bindings in do notation, producing type errors
  3. Overlapping type class instances that cause GHC to reject the program with ambiguous dispatch errors

These mistakes appear frequently in real-world DATABRICKS code. DodaTech's contributors have identified these patterns through analysis of open-source projects and production systems.

Practice Exercise

Write a pure function that safely divides two integers using Maybe, then test it with edge cases like division by zero and negative numbers.

This exercise reinforces the concepts covered in this guide. Try implementing it before checking online solutions.

FAQ

### Why does auto-scaling not scale down after a job completes?

Autoscaling has a cooldown period (typically 60-300 seconds) to prevent flapping. Also, cached data keeps executors alive. Unpersist DataFrames: df.unpersist() or configure spark.dynamicAllocation.cachedExecutorIdleTimeout to evict cached executors.

What's the difference between Databricks auto-scaling and Spark dynamic allocation?

Databricks auto-scaling adds/removes cluster nodes (VMs). Spark dynamic allocation adds/removes executors within existing nodes. For maximum elasticity, enable both — Databricks handles infrastructure scaling, Spark handles executor allocation within nodes.

Does spot/preemptible instance affect auto-scaling?

Yes. If using spot instances with a fallback, the cluster may scale up more slowly because spot capacity varies. Use on-demand instances for the minimum worker count and spot for the elastic range.

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro