. Advertisement .
..3..
. Advertisement .
..4..
I encounter ”failed to initialize nvml driver library version mismatch” error as the title says when trying to run a GPU workload:
Warning Failed 91s (x4 over 2m12s) kubelet, ip-10-0-129-17.us-west-2.compute.internal Error: failed to create containerd task: OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:449: container init caused "process_linux.go:432: running prestart hook 0 caused "error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: nvml error: driver/library version mismatch\\\\n""": unknown
How can i fix it so the error goes away?
The cause: This error occurs when NVIDIA drivers on a node are upgraded but the run-time driver information is not up to date.
The solution: Drain and reboot the worker
The simplest way to resolve the problem is to reboot the node. After the upgrade, reboot the node to ensure that the drivers are correctly initialized.
Before deploying new workloads on a GPU worker node, we recommend draining the node, completing the driver upgrade, and then rebooting the node.