Solving Initialization Error when running CUDA Application inside Docker container
There were a few times the error “cudaErrorInitializationError with exit code 3” was hit when our team run CUDA application inside Docker container. (Depending on your code, you may see the error message “initialization error” instead.)
However, if we check the system, nothing strange could be found. Moreover, nvidia-smi
gives a normal output, both in the host and inside the container:
Explanation of the Problem
Problem 1
In order for your container to access the GPU, it must be initialised before Docker daemon starts. In Linux environment, this is done by enabling persistence mode for the GPU, so that the GPU would get initialised even there is no client using it. The NVIDIA Persistence Daemon is used for this purpose. You could check if you have this daemon running by:
systemctl status nvidia-persistenced
Problem 2
Now I have the NVIDIA Persistence Daemon running already, but the problem still persist. How could it be?
The trick is that it takes time for the NVIDIA Persistence Daemon to initialise all GPUs in the server. If Docker daemon starts up before a GPU has been initialised, then the GPU could not be used inside Docker container.
As a result, depending on the startup progress of NVIDIA Persistence Daemon, only GPUs being initialised before Docker daemon starts could be used inside container and the others would be unusable.
Resolution Method
To solve the problem, we have to ensure that the NVIDIA Persistence Daemon has completed startup first, before starting the Docker daemon.
This is done by customising the systemctl unit file of Docker.
Step 1: Copy the systemctl unit file of Docker for modification
Copy the systemctl unit file of Docker from /usr/lib/systemd/system
to /etc/systemd/system
:
cp /usr/lib/systemd/system/docker.service /etc/systemd/system
Step 2: Modify the systemctl unit file of Docker
Add the service name nvidia-persistenced.service
to the end of After
and Wants
directives under the [Unit]
section, as seen in the example below:
[Unit]
Description=Docker Application Container Engine
Documentation=http://docs.docker.com
After=network.target rhel-push-plugin.service registries.service nvidia-persistenced.service
Wants=docker-storage-setup.service nvidia-persistenced.service
Requires=rhel-push-plugin.service registries.service
Requires=docker-cleanup.timer
Reload the systemd unit files by calling:
systemctl daemon-reload
Step 3: Reboot the server
reboot
Step 4: You’re good to go
Start your container and now you should have no problem running your CUDA applications.