Nice Scripts

The following are some nice scripts provided by Yvan Satyawan <saty> to make working with the DGX servers easier. Each of these scripts can be saved to somewhere in your home directory and, can be run if they are set as executable and the directory the script is saved in is added to your PATH.

Connecting to the DGX servers

This shell script makes connecting to the DGX servers a lot easier by providing an easier argument-based interface to SSH and forward ports. Put this shell script somewhere on your local machine and call it whenever you want to connect to the DGX servers.

#!/bin/bash

print_usage() {
    echo "usage: "$0" [ -s server ] [ -f port ] USERNAME"
    echo ""
    echo "If no arguments are given, connects to DGX using the given username"
    echo "optional arguments:"
    echo "    -s server: set the servername to connect to. Defaults to dgx."
    echo "               the server parameter is prepended to .cloudlab.zhaw.ch."
    echo "    -f port:   if the flag is set, then instead of logging in, an"
    echo "               SSH tunnel is built between port 8888 on the localhost"
    echo "               and the given port on the selected server."
    echo "    -h:        show this usage page."
    echo ""
    echo "positional arguments:"
    echo "    USERNAME:  your ZHAW short username"
}

SERVER="dgx"
PORT=""
fflag=false

# Do getopts
options=':s:f:h'
while getopts $options option
do
    case "$option" in
        s  ) SERVER=$OPTARG;;
        f  ) fflag=true; PORT=$OPTARG;;
        h  ) print_usage; exit;;
        \? ) echo: "unknown option: -$OPTARG" >&2; print_usage; exit 1;;
        :  ) echo "missing option argument for -$OPTARG">&2; print_usage; exit 1;;
    esac
done

USERNAME=${@:$OPTIND:1}

# Make sure a username was given
if [ -z "$USERNAME" ]
then
    echo "error: USERNAME is a required parameter"
    print_usage
    exit 1
fi

if [ ! $fflag ]
then
    echo "Logging in to "$SERVER" with username "$USERNAME""
    ssh $USERNAME@$SERVER.cloudlab.zhaw.ch;
else
    echo "Forwarding port "$PORT" from "$SERVER" server with login "$USERNAME""
    ssh -L 127.0.0.1:8888:0.0.0.0:$PORT -N $USERNAME@$SERVER.cloudlab.zhaw.ch;
fi
exit 0

Interactive `srun` command

This is a shell script that makes srun interactive. This should be placed somewhere in your home folder on the server

#!/bin/bash

# This program makes srun interactive

# Ask user for parameters
echo "Number of CPUs (Defaults to 1):"
read ncpu

echo "Memory to use (in GBs. Defaults to 32):"
read mem

echo "Number of GPUs (Defaults to 1):"
read ngpu

echo "Job name (Defaults to executable name):"
read jobname

# checks parameters and applies defaults if left empty
# CPU
if test -z "$ncpu"
then
    echo "Using 1 CPU"
    ncpu=1
else
    echo "Using "$ncpu" CPU Core(s)"
fi

# Memory
if test -z "$mem"
then
    echo "Using 32G of memory"
    mem=32
else
    echo "Using "$mem"G of memory"
fi

# GPU
if test -z "$ngpu"
then
    echo "Using 1 GPU"
    ngpu=1
else
    echo "Using "$ngpu" GPU(s)"
fi

# Job Name
if test -z "$jobname"

then
    echo "Using 'bash' as job name"
    jobname=bash
else
    echo "Using "$jobname" as job name"
fi

srun --job-name=$jobname --pty --ntasks=1 --cpus-per-task=$ncpu --mem=${mem}G --gres=gpu:$ngpu bash

Non-Interactive `sbatch` command

This is a minimal example sbatch script. The script includes directives (lines starting with #SBATCH) for configuring various aspects of the job.

#!/bin/bash

 #SBATCH --job-name=my_job
 #SBATCH --output=job_output.txt  # Directs the standard output to 'job_output.txt'
 #SBATCH --error=job_error.txt    # Directs the standard error to 'job_error.txt'
 #SBATCH --ntasks=1               # Requests 1 task
 #SBATCH --time=10:00             # Sets a limit of 10 minutes for the job
 #SBATCH --mem=100                # Requests 100 megabytes of memory
 #SBATCH --partition=standard     # Submits the job to the 'standard' partition
 #SBATCH --cpus-per-task=4        # Requests 4 CPUs per task
 #SBATCH --gres=gpu:2             # Requests 2 GPU

 # Load any modules or software if needed
 # module load python/3.8

 # Execute your application
 ./my_program

Simpler `nvidia-docker`

This shell script should be placed within your home directory on the server. It makes starting a nvidia-docker container simpler by requiring fewer parameters.

#!/bin/bash

# A splified method to run an nvidia-docker container that's been set up to run
# most needs of the DataLab team

print_usage() {
    echo "ndock: A simplified nvidia-docker command"
    echo "usage: ndock local_dir_mapping docker_image [port]"
    echo ""
    echo "local_dir_mapping: the directory to be mapped to /workspace inside "
    echo "                   the docker container"
    echo "docker_image:      the docker image tag or id to be instantiated"
    echo "port:              if the ssh port is to be exposed, the server port"
    echo "                   to expose it on"
}

# Actual code
if [ $# -eq 0 ];
    then
    print_usage
    exit 1
else
    if test -z "$3";
        then
        echo "Running docker image "$2" from "$1"";
    else
        echo "Running docker image "$2" from "$1", exposing ssh on port "$3"";
        PORT="-p 127.0.0.1:"$3":22";
    fi

    nvidia-docker run --shm-size=16g -it ${PORT} -v "$1":/workspace "$2" bash;
fi

Making a TensorBoard docker container

To make a TensorBoard docker container that runs from a specific logging folder, first create a Dockerfile as follows:

# Dockerfile that installs tensorboard

FROM amd64/python:3.6.9

WORKDIR /workspace

RUN chmod -R a+w . && \
    pip install --upgrade pip && \
    pip install --no-cache-dir tensorflow==1.14.0 tensorboard==1.14.0

EXPOSE 6006
ENTRYPOINT tensorboard --logdir /workspace

And build the image. Then, use the following script to start the docker container.

#!/bin/bash

if test -z "$1";
    then
    echo "Error: No workspace mapping specified."
    return -1
fi
echo "Running saty/tensorboard with "$1" as workspace. Mapping port localhost:[YOUR_PORT] -> 0.0.0.0:6006"
srun --pty --ntasks=1 --cpus-per-task=4 --mem=8G docker run -it -v "$1":/workspace -p 127.0.0.1:[YOUR PORT]:6006 --name tensorboard [IMAGE_NAME]

[YOUR_PORT]: This is the port you would like to expose the docker container at.
[IMAGE_NAME]: This is the name of the image built using the Dockerfile above.

Then forward YOUR_PORT to a port on your localhost using the following script:

#!/bin/bash
echo "Forwarding TensorBoard port from dgx server"
ssh -L 127.0.0.1:5000:0.0.0.0:[PORT] -N [USERNAME]@[SERVER].cloudlab.zhaw.ch;

[PORT]: The port that was chosen in the previous script.
[USERNAME]: Your username on the DGX servers.
[SERVER]: The DGX server that is hosting the Docker container.

It’s easiest to just always reserve one port for TensorBoard so the variables above never need to be changed.

Nice Scripts

Connecting to the DGX servers

Interactive srun command

Non-Interactive sbatch command

Simpler nvidia-docker

Making a TensorBoard docker container

Interactive `srun` command

Non-Interactive `sbatch` command

Simpler `nvidia-docker`