Frequently used Commands for Admins
Warning
This section is dedicated to the administrators of the cluster. Most commands can only be executed with root permissions (i.e. after running sudo -i
).
User Management
Query Users
members gpu-user
Show all users in the groupgpu-user
members docker
Show all users in the groupdocker
getent passwd <user>
Query a user
Adding Users
On each DGX:
usermod <user name> -G gpu-user -a
usermod <user name> -G docker -a
Afterwards, assign Ports to the user.
Remove Users
On each DGX:
gpasswd -d <user name> gpu-user
gpasswd -d <user name> docker
On one DGX:
mkdir /cluster/home/empty
rsync -av --delete --progress /cluster/home/empty /cluster/home/<user name>
rm -rf /cluster/home/<user name> /cluster/home/empty
mkdir /cluster/data/empty
rsync -av --delete --progress /cluster/data/empty /cluster/data/<user name>
rm -rf /cluster/data/<user name> /cluster/data/empty
Afterwards, remove port assignment.
Docker
Cleanup Containers and Images
docker rm -f $(docker ps -f status=exited -q)
docker rm -f $(docker ps -f status=error -q)
docker rm -f $(docker ps -f status=created -q)
docker image prune -af
Or start docker with flag --rm
Ceph-Storage
ceph -s
Check Ceph Statusceph orch daemon restart osd.12
Restart OSD 12
Slurm
sinfo
Get Slurm Info (find node name)scontrol update NodeName=dgx State=RESUME
Reset Node