Nice Scripts
The following are some nice scripts provided by Yvan Satyawan <saty> to make working with the DGX servers easier.
Each of these scripts can be saved to somewhere in your home directory and, can be run if they are set as executable and the directory the script is saved in is added to your PATH
.
Connecting to the DGX servers
This shell script makes connecting to the DGX servers a lot easier by providing an easier argument-based interface to SSH and forward ports. Put this shell script somewhere on your local machine and call it whenever you want to connect to the DGX servers.
#!/bin/bash
print_usage() {
echo "usage: "$0" [ -s server ] [ -f port ] USERNAME"
echo ""
echo "If no arguments are given, connects to DGX using the given username"
echo "optional arguments:"
echo " -s server: set the servername to connect to. Defaults to dgx."
echo " the server parameter is prepended to .cloudlab.zhaw.ch."
echo " -f port: if the flag is set, then instead of logging in, an"
echo " SSH tunnel is built between port 8888 on the localhost"
echo " and the given port on the selected server."
echo " -h: show this usage page."
echo ""
echo "positional arguments:"
echo " USERNAME: your ZHAW short username"
}
SERVER="dgx"
PORT=""
fflag=false
# Do getopts
options=':s:f:h'
while getopts $options option
do
case "$option" in
s ) SERVER=$OPTARG;;
f ) fflag=true; PORT=$OPTARG;;
h ) print_usage; exit;;
\? ) echo: "unknown option: -$OPTARG" >&2; print_usage; exit 1;;
: ) echo "missing option argument for -$OPTARG">&2; print_usage; exit 1;;
esac
done
USERNAME=${@:$OPTIND:1}
# Make sure a username was given
if [ -z "$USERNAME" ]
then
echo "error: USERNAME is a required parameter"
print_usage
exit 1
fi
if [ ! $fflag ]
then
echo "Logging in to "$SERVER" with username "$USERNAME""
ssh $USERNAME@$SERVER.cloudlab.zhaw.ch;
else
echo "Forwarding port "$PORT" from "$SERVER" server with login "$USERNAME""
ssh -L 127.0.0.1:8888:0.0.0.0:$PORT -N $USERNAME@$SERVER.cloudlab.zhaw.ch;
fi
exit 0
Interactive srun
command
This is a shell script that makes srun
interactive.
This should be placed somewhere in your home folder on the server
#!/bin/bash
# This program makes srun interactive
# Ask user for parameters
echo "Number of CPUs (Defaults to 1):"
read ncpu
echo "Memory to use (in GBs. Defaults to 32):"
read mem
echo "Number of GPUs (Defaults to 1):"
read ngpu
echo "Job name (Defaults to executable name):"
read jobname
# checks parameters and applies defaults if left empty
# CPU
if test -z "$ncpu"
then
echo "Using 1 CPU"
ncpu=1
else
echo "Using "$ncpu" CPU Core(s)"
fi
# Memory
if test -z "$mem"
then
echo "Using 32G of memory"
mem=32
else
echo "Using "$mem"G of memory"
fi
# GPU
if test -z "$ngpu"
then
echo "Using 1 GPU"
ngpu=1
else
echo "Using "$ngpu" GPU(s)"
fi
# Job Name
if test -z "$jobname"
then
echo "Using 'bash' as job name"
jobname=bash
else
echo "Using "$jobname" as job name"
fi
srun --job-name=$jobname --pty --ntasks=1 --cpus-per-task=$ncpu --mem=${mem}G --gres=gpu:$ngpu bash
Non-Interactive sbatch
command
This is a minimal example sbatch
script.
The script includes directives (lines starting with #SBATCH) for configuring various aspects of the job.
#!/bin/bash
#SBATCH --job-name=my_job
#SBATCH --output=job_output.txt # Directs the standard output to 'job_output.txt'
#SBATCH --error=job_error.txt # Directs the standard error to 'job_error.txt'
#SBATCH --ntasks=1 # Requests 1 task
#SBATCH --time=10:00 # Sets a limit of 10 minutes for the job
#SBATCH --mem=100 # Requests 100 megabytes of memory
#SBATCH --partition=standard # Submits the job to the 'standard' partition
#SBATCH --cpus-per-task=4 # Requests 4 CPUs per task
#SBATCH --gres=gpu:2 # Requests 2 GPU
# Load any modules or software if needed
# module load python/3.8
# Execute your application
./my_program
Simpler nvidia-docker
This shell script should be placed within your home directory on the server. It makes starting a nvidia-docker container simpler by requiring fewer parameters.
#!/bin/bash
# A splified method to run an nvidia-docker container that's been set up to run
# most needs of the DataLab team
print_usage() {
echo "ndock: A simplified nvidia-docker command"
echo "usage: ndock local_dir_mapping docker_image [port]"
echo ""
echo "local_dir_mapping: the directory to be mapped to /workspace inside "
echo " the docker container"
echo "docker_image: the docker image tag or id to be instantiated"
echo "port: if the ssh port is to be exposed, the server port"
echo " to expose it on"
}
# Actual code
if [ $# -eq 0 ];
then
print_usage
exit 1
else
if test -z "$3";
then
echo "Running docker image "$2" from "$1"";
else
echo "Running docker image "$2" from "$1", exposing ssh on port "$3"";
PORT="-p 127.0.0.1:"$3":22";
fi
nvidia-docker run --shm-size=16g -it ${PORT} -v "$1":/workspace "$2" bash;
fi
Making a TensorBoard docker container
To make a TensorBoard docker container that runs from a specific logging folder, first create a Dockerfile as follows:
# Dockerfile that installs tensorboard
FROM amd64/python:3.6.9
WORKDIR /workspace
RUN chmod -R a+w . && \
pip install --upgrade pip && \
pip install --no-cache-dir tensorflow==1.14.0 tensorboard==1.14.0
EXPOSE 6006
ENTRYPOINT tensorboard --logdir /workspace
And build the image. Then, use the following script to start the docker container.
#!/bin/bash
if test -z "$1";
then
echo "Error: No workspace mapping specified."
return -1
fi
echo "Running saty/tensorboard with "$1" as workspace. Mapping port localhost:[YOUR_PORT] -> 0.0.0.0:6006"
srun --pty --ntasks=1 --cpus-per-task=4 --mem=8G docker run -it -v "$1":/workspace -p 127.0.0.1:[YOUR PORT]:6006 --name tensorboard [IMAGE_NAME]
- [YOUR_PORT]
This is the port you would like to expose the docker container at.
- [IMAGE_NAME]
This is the name of the image built using the Dockerfile above.
Then forward YOUR_PORT
to a port on your localhost using the following script:
#!/bin/bash
echo "Forwarding TensorBoard port from dgx server"
ssh -L 127.0.0.1:5000:0.0.0.0:[PORT] -N [USERNAME]@[SERVER].cloudlab.zhaw.ch;
- [PORT]
The port that was chosen in the previous script.
- [USERNAME]
Your username on the DGX servers.
- [SERVER]
The DGX server that is hosting the Docker container.
It’s easiest to just always reserve one port for TensorBoard so the variables above never need to be changed.