Solving Initialization Error when running CUDA Application inside Docker container
There were a few times the error “cudaErrorInitializationError with exit code 3” was hit when our team run CUDA application inside Docker container. (Depending on your code, you may see the error message “initialization error” instead.)
However, if we check the system, nothing strange could be found. Moreover,
nvidia-smi gives a normal output, both in the host and inside the container:
Explanation of the Problem
In order for your container to access the GPU, it must be initialised before Docker daemon starts. In Linux environment, this is done by enabling persistence mode for the GPU, so that the GPU would get initialised even there is no client using it. The NVIDIA Persistence Daemon is used for this purpose. You could check if you have this daemon running by:
systemctl status nvidia-persistenced
Now I have the NVIDIA Persistence Daemon running already, but the problem still persist. How could it be?
The trick is that it takes time for the NVIDIA Persistence Daemon to initialise all GPUs in the server. If Docker daemon starts up before a GPU has been initialised, then the GPU could not be used inside Docker container.
As a result, depending on the startup progress of NVIDIA Persistence Daemon, only GPUs being initialised before Docker daemon starts could be used inside container and the others would be unusable.
To solve the problem, we have to ensure that the NVIDIA Persistence Daemon has completed startup first, before starting the Docker daemon.
This is done by customising the systemctl unit file of Docker.
Step 1: Copy the systemctl unit file of Docker for modification
Copy the systemctl unit file of Docker from
cp /usr/lib/systemd/system/docker.service /etc/systemd/system
Step 2: Modify the systemctl unit file of Docker
Add the service name
nvidia-persistenced.service to the end of
Wants directives under the
[Unit] section, as seen in the example below:
Description=Docker Application Container Engine
After=network.target rhel-push-plugin.service registries.service nvidia-persistenced.service
Reload the systemd unit files by calling:
Step 3: Reboot the server
Step 4: You’re good to go
Start your container and now you should have no problem running your CUDA applications.