Frequently used Commands for Admins

Warning

This section is dedicated to the administrators of the cluster. Most commands can only be executed with root permissions (i.e. after running sudo -i).

User Management

Query Users

  • members gpu-user Show all users in the group gpu-user

  • members docker Show all users in the group docker

  • getent passwd <user> Query a user

Adding Users

On each DGX:

usermod <user name> -G gpu-user -a
usermod <user name> -G docker -a

Afterwards, assign Ports to the user.

Remove Users

On each DGX:

gpasswd -d <user name> gpu-user
gpasswd -d <user name> docker

On one DGX:

mkdir /cluster/home/empty
rsync -av --delete --progress /cluster/home/empty /cluster/home/<user name>
rm -rf /cluster/home/<user name> /cluster/home/empty

mkdir /cluster/data/empty
rsync -av --delete --progress /cluster/data/empty /cluster/data/<user name>
rm -rf /cluster/data/<user name> /cluster/data/empty

Afterwards, remove port assignment.

Docker

Cleanup Containers and Images

docker rm -f $(docker ps -f status=exited -q)
docker rm -f $(docker ps -f status=error -q)
docker rm -f $(docker ps -f status=created -q)
docker image prune -af

Or start docker with flag --rm

Ceph-Storage

  • ceph -s Check Ceph Status

  • ceph orch daemon restart osd.12 Restart OSD 12

Slurm

  • sinfo Get Slurm Info (find node name)

  • scontrol update NodeName=dgx State=RESUME Reset Node