Solving Initialization Error when running CUDA Application inside Docker container

There were a few times the error “cudaErrorInitializationError with exit code 3” was hit when our team run CUDA application inside Docker container. (Depending on your code, you may see the error message “initialization error” instead.)

cudaErrorInitializationError with exit code 3 when running CUDA sample application “nvgraph_Pagerank”

However, if we check the system, nothing strange could be found. Moreover, nvidia-smi gives a normal output, both in the host and inside the container:

A typical output from “nvidia-smi” command

Explanation of the Problem

Problem 1

In order for your container to access the GPU, it must be initialised before Docker daemon starts. In Linux environment, this is done by enabling persistence mode for the GPU, so that the GPU would get initialised even there is no client using it. The NVIDIA Persistence Daemon is used for this purpose. You could check if you have this daemon running by:

systemctl status nvidia-persistenced

Problem 2

Now I have the NVIDIA Persistence Daemon running already, but the problem still persist. How could it be?

The trick is that it takes time for the NVIDIA Persistence Daemon to initialise all GPUs in the server. If Docker daemon starts up before a GPU has been initialised, then the GPU could not be used inside Docker container.

As a result, depending on the startup progress of NVIDIA Persistence Daemon, only GPUs being initialised before Docker daemon starts could be used inside container and the others would be unusable.

Resolution Method

To solve the problem, we have to ensure that the NVIDIA Persistence Daemon has completed startup first, before starting the Docker daemon.

This is done by customising the systemctl unit file of Docker.

Step 1: Copy the systemctl unit file of Docker for modification

Copy the systemctl unit file of Docker from /usr/lib/systemd/system to /etc/systemd/system :

cp /usr/lib/systemd/system/docker.service /etc/systemd/system

Step 2: Modify the systemctl unit file of Docker

Add the service name nvidia-persistenced.service to the end of After and Wants directives under the [Unit] section, as seen in the example below:

[Unit]
Description=Docker Application Container Engine
Documentation=http://docs.docker.com
After=network.target rhel-push-plugin.service registries.service nvidia-persistenced.service
Wants=docker-storage-setup.service nvidia-persistenced.service
Requires=rhel-push-plugin.service registries.service
Requires=docker-cleanup.timer

Reload the systemd unit files by calling:

systemctl daemon-reload

Step 3: Reboot the server

reboot

Step 4: You’re good to go

Start your container and now you should have no problem running your CUDA applications.

Running the CUDA sample application “nvgraph_Pagerank” successfully inside a Docker container

Infrastructure Architect experienced in design and implementation of IT solutions for enterprises