Skip to content

Databricks Cluster Start Failure Fix

DodaTech Updated 2026-06-24 3 min read

In this tutorial, you'll learn about Databricks Cluster Start Failure Fix. We cover key concepts, practical examples, and best practices.

Your Databricks cluster fails to start:

Cluster terminated. Reason: Cloud Provider Failure
Failed to launch nodes: Insufficient instance capacity.

Cluster failures occur when the cloud provider cannot provision the requested VM type in the specified region, the cluster's init script has errors, or the Databricks Runtime version is incompatible with the cluster configuration.

Step-by-Step Fix

1. Check the cluster's event log

WRONG — guessing the cause:

RIGHT — read the event log:

  1. Open the cluster page
  2. Click "Event Log" tab
  3. Look for the termination reason

Common terminal reasons:

Cloud Provider Failure → Capacity or quota issue
Init Script Failure → Custom script has errors
Spark Exception → Configuration issue
Inactivity → Cluster auto-terminated

2. Fix cloud provider capacity issues

WRONG — repeatedly trying the same instance type:

RIGHT — change instance type or region:

  1. Edit the cluster
  2. Change Worker Type to a different VM family (e.g., from m5d.xlarge to r5.xlarge)
  3. Or use the same type in a different availability zone

Enable Spot instances for cost savings:

Cluster > Advanced Options > Instances > Spot: Yes
Spot instances have higher failure rates but are cheaper.

3. Fix init script errors

WRONG — init script has errors that crash the cluster:

RIGHT — test and correct the init script:

# Init script example (must exit 0)
#!/bin/bash
set -e  # Exit on error

# Install a library
/databricks/python/bin/pip install requests

# Log success
echo "Init script completed" >> /tmp/init.log
exit 0

To debug init scripts:

  1. Create a small test cluster without the init script
  2. Run the script manually via the Notebook: %sh /dbfs/init_scripts/test.sh
  3. Fix errors
  4. Add the script back to the cluster

4. Increase cluster timeout

WRONG — default 5-minute timeout too short:

RIGHT — increase creation timeout:

# Via API
import requests
data = {
    "cluster_name": "my-cluster",
    "spark_version": "13.3.x-scala2.12",
    "node_type_id": "i3.xlarge",
    "num_workers": 5,
    "custom_tags": {"creation_timeout": "1800"}  # 30 min
}

Or in the UI: Advanced Options > Timeout > 30 minutes.

5. Check Databricks Runtime version

WRONG — using a runtime version that doesn't support your config:

RIGHT — use a supported runtime:

Runtime: 13.3 LTS (long-term support, stable)
Runtime: 14.3 LTS (newer features)
Runtime: 15.0 (latest, may have compatibility issues)

Migrate between LTS versions if you encounter issues.

6. Verify cloud resource quotas

WRONG — hitting instance quota limits:

RIGHT — request quota increase:

  • AWS: Service Quotas console > EC2 > Running On-Demand instances > Request increase
  • Azure: Subscription > Usage + quotas > Request increase
  • GCP: IAM & Admin > Quotas > Compute Engine API > Increase

Expected output: cluster starts with "Running" status.

Prevention

  • Use multiple instance types in the cluster policy for fallback.
  • Test init scripts on a standalone cluster before production use.
  • Use LTS Databricks Runtime versions for stability.
  • Monitor cluster start times and set appropriate timeouts.
  • Request quota increases before launching large clusters.

Common Mistakes with cluster fail

  1. Forgetting that lazy evaluation defers computation until the value is forced, causing space leaks with unevaluated thunks
  2. Using return to exit a function early instead of wrapping a pure value in the monad
  3. Mixing let bindings with <- bindings in do notation, producing type errors

These mistakes appear frequently in real-world DATABRICKS code. DodaTech's contributors have identified these patterns through analysis of open-source projects and production systems.

Practice Exercise

Write a pure function that safely divides two integers using Maybe, then test it with edge cases like division by zero and negative numbers.

This exercise reinforces the concepts covered in this guide. Try implementing it before checking online solutions.

FAQ

### Why does my Databricks cluster fail during auto-scaling?

Auto-scaling requests new nodes from the cloud provider. If the provider is out of capacity for the instance type, scaling fails. Enable "Enable spot instance fallback" (AWS) or switch to a less popular instance type.

What's the difference between cluster termination and failure?

Termination is expected (auto-termination due to inactivity, or manual stop). Failure means the cluster could not start or was terminated by the cloud provider. Check the Termination Reason in the Event Log to distinguish.

How do I debug cluster startup issues?

Use the cluster's "Driver Logs" (accessible from the cluster page). Also check /databricks/init_scripts/ for init script logs. Run %sh dmesg | tail -20 in a notebook to see system-level startup errors.

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro