Fix Azure AKS Gpu Node Errors
When working with Azure AKS, you may encounter a configuration error that prevents your deployment from working. This guide explains the most common mistake with gpu node and shows the exact fix.
A Common Mistake
Creating a GPU node pool without installing the necessary device plugin, causing GPU workloads to fail.
The incorrect command:
az aks nodepool add --cluster-name my-aks --resource-group my-rg --name gpupool --node-count 1 --node-vm-size Standard_NC6s_v3
Error output:
GPU node pool created.
kubectl apply -f gpu-job.yaml
Pod requesting nvidia.com/gpu:
0/1 nodes available: 1 Insufficient nvidia.com/gpu.
The NVIDIA device plugin is not installed. GPUs are not advertised to Kubernetes.
The Correct Approach
The right way to configure gpu node in Azure AKS:
az aks nodepool add --cluster-name my-aks --resource-group my-rg --name gpupool --node-count 1 --node-vm-size Standard_NC6s_v3
# Then install the NVIDIA device plugin
kubectl apply -f https://raw.githubusercontent.com/Azure/aks-engine/master/scripts/device-plugin.yaml
Successful result:
NVIDIA device plugin installed.
kubectl get nodes -o custom-columns=name:.metadata.name,gpu:.status.allocatable.nvidia\.com/gpu
NAME gpu
aks-gpupool-xxxxx-vmss 1
GPU workload can request and use the GPU.
How to Prevent This
Install the NVIDIA device plugin for GPU node pools. Check GPU availability with kubectl describe node. Use nodeSelector: accelerator: nvidia for GPU pods. GPU VM sizes: NC (NVIDIA Tesla), ND (NVIDIA Tesla for training), NV (NVIDIA Tesla for visualization).
FAQ
Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro. Secure your cloud with DodaTech.
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro