Working with Docker

The DGX server is configured to use Docker containers to keep any dependencies containerized. The typical process of working with Docker is:

Building or pulling a Docker image.
Tagging the built or pulled Docker image.
Running a Docker container using docker.
Attaching to the container either:
1. Directly through a terminal emulator, or
2. Setting up the Docker container to be accessible through SSH.

Pulling a Docker Image

Pulling a Docker image is the easiest way to get started. Pulling refers to downloading a Docker image from some repository to the DGX server. This is done by running docker pull NAME[:TAG|@DIGEST].

After pulling, don’t forget to tag the image so it is recognizable.

More information on docker pull can be found at Docker Docs.

Building a Docker Image

Docker images can also be built from Dockerfiles. This is done with the following:

Create a folder and put your Dockerfile, named Dockerfile in that folder.
Run docker build [path to folder] to build the docker image.
When building Docker images, sometimes leftover intermediary images are built and not removed. This happens usually when a Dockerfile fails to build. These images must be removed to save space on the server and to keep the images library clean. This is done by running docker rmi IMAGE_NAME

Tagging a Docker Image

On the DGX server, images must be tagged with the format your_short_username/description. This can be done by running docker tag IMAGE_NAME YOUR_SHORT_NAME/DESCRIPTION.

Optional but recommended: Utilize Docker Volumes for Datasets

See here.

Running the Docker Container

Experiments must be run within the context of a SLURM session. Doing this within a tmux or screen session is strongly recommended. This can be done using the following steps.

Start a tmux/screen session.
In the tmux/screen session, start a SLURM session with srun --job-name=$JOBNAME --pty --ntasks=1 --cpus-per-task=$NCPU --mem=${MEM}G --gres=gpu:$NGPU bash

$JOBNAME
A short job name to make your session identifiable. An example for a job using 2 GPUs is 2_gpu

$NCPU
Number of CPU cores to assign to the job.

$MEM
Amount of RAM to assign to the job.

$NGPU
Number of GPUs to assign to the job.
Use docker to start a docker container with access to the GPUs with: docker run --shm-size=16g -it -p $PortOnDockerHost:$PortInDockerContainer -v $LOCAL_DIR:/workspace $DOCKER_IMAGE bash;

--shm-size

Shared memory between processes. 16g is a good value for most jobs.

$PortOnDockerHost
Docker creates a tunnel between the server port to the container port on this port. A port must first be exposed within the Dockerfile. The $PORT variable value must be within your allocated port range, which can be found at Important notes

$PortInDockerContainer
The port that is exposed within the Dockerfile, i.e. $PortOnDockerHost is mapped to $PortInDockerContainer

$DOCKER_IMAGE
The tag given in the previous section.
If for some reason you’re not automatically attached, run:
1. docker ps to see what the name/id of the container is, then
2. docker attach [container_name/id] to attach to it.
Once you’re done and you’ve exited the container, your container is stopped but not actually removed.
1. If you still want to use it eventually, just run:
  docker start [container_name/id] to restart the container, then
  
  docker attach [container_name/id] to attach to it.
2. If you don’t, then don’t forget to remove it using docker rm [container_name/id]
To check if a stopped container still exists, run docker ps -a

The next section will explain how to use Docker with SSH to connect remotely directly to your Docker container.