Databricks Cluster Start Failure Fix
In this tutorial, you'll learn about Databricks Cluster Start Failure Fix. We cover key concepts, practical examples, and best practices.
Your Databricks cluster fails to start:
Cluster terminated. Reason: Cloud Provider Failure
Failed to launch nodes: Insufficient instance capacity.
Cluster failures occur when the cloud provider cannot provision the requested VM type in the specified region, the cluster's init script has errors, or the Databricks Runtime version is incompatible with the cluster configuration.
Step-by-Step Fix
1. Check the cluster's event log
WRONG — guessing the cause:
RIGHT — read the event log:
- Open the cluster page
- Click "Event Log" tab
- Look for the termination reason
Common terminal reasons:
Cloud Provider Failure → Capacity or quota issue
Init Script Failure → Custom script has errors
Spark Exception → Configuration issue
Inactivity → Cluster auto-terminated
2. Fix cloud provider capacity issues
WRONG — repeatedly trying the same instance type:
RIGHT — change instance type or region:
- Edit the cluster
- Change Worker Type to a different VM family (e.g., from m5d.xlarge to r5.xlarge)
- Or use the same type in a different availability zone
Enable Spot instances for cost savings:
Cluster > Advanced Options > Instances > Spot: Yes
Spot instances have higher failure rates but are cheaper.
3. Fix init script errors
WRONG — init script has errors that crash the cluster:
RIGHT — test and correct the init script:
# Init script example (must exit 0)
#!/bin/bash
set -e # Exit on error
# Install a library
/databricks/python/bin/pip install requests
# Log success
echo "Init script completed" >> /tmp/init.log
exit 0
To debug init scripts:
- Create a small test cluster without the init script
- Run the script manually via the Notebook:
%sh /dbfs/init_scripts/test.sh - Fix errors
- Add the script back to the cluster
4. Increase cluster timeout
WRONG — default 5-minute timeout too short:
RIGHT — increase creation timeout:
# Via API
import requests
data = {
"cluster_name": "my-cluster",
"spark_version": "13.3.x-scala2.12",
"node_type_id": "i3.xlarge",
"num_workers": 5,
"custom_tags": {"creation_timeout": "1800"} # 30 min
}
Or in the UI: Advanced Options > Timeout > 30 minutes.
5. Check Databricks Runtime version
WRONG — using a runtime version that doesn't support your config:
RIGHT — use a supported runtime:
Runtime: 13.3 LTS (long-term support, stable)
Runtime: 14.3 LTS (newer features)
Runtime: 15.0 (latest, may have compatibility issues)
Migrate between LTS versions if you encounter issues.
6. Verify cloud resource quotas
WRONG — hitting instance quota limits:
RIGHT — request quota increase:
- AWS: Service Quotas console > EC2 > Running On-Demand instances > Request increase
- Azure: Subscription > Usage + quotas > Request increase
- GCP: IAM & Admin > Quotas > Compute Engine API > Increase
Expected output: cluster starts with "Running" status.
Prevention
- Use multiple instance types in the cluster policy for fallback.
- Test init scripts on a standalone cluster before production use.
- Use LTS Databricks Runtime versions for stability.
- Monitor cluster start times and set appropriate timeouts.
- Request quota increases before launching large clusters.
Common Mistakes with cluster fail
- Forgetting that lazy evaluation defers computation until the value is forced, causing space leaks with unevaluated thunks
- Using
returnto exit a function early instead of wrapping a pure value in the monad - Mixing let bindings with <- bindings in do notation, producing type errors
These mistakes appear frequently in real-world DATABRICKS code. DodaTech's contributors have identified these patterns through analysis of open-source projects and production systems.
Practice Exercise
Write a pure function that safely divides two integers using Maybe, then test it with edge cases like division by zero and negative numbers.
This exercise reinforces the concepts covered in this guide. Try implementing it before checking online solutions.
FAQ
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro