Optimize STT training with Huggingface

To be able to train the XLSR-1B model, which has 1 Billion trainable parameters, which corresponds to 3GB file size, you need to the A100 GPU.

However, there are some tricks that you can use to make training a lot faster. I assume that you are already using batch-sizes and gradient-accumulation. If your model is really too large, use gradient-checkpointing (I did not fully test this option, but it makes training slower).

This is a summary of the things that worked for me from this blog: https://huggingface.co/docs/transformers/performance

Use Mixed Precision

Set the bf16 flat to ture (can only be used on the A100 and those with Ampere hardware and newer). For others just set fp16 to true. This makes the model smaller and you may increse the batch size -> faster training.

Optimizer Chioce

Adam is very memory intensive. You can use adafactor, but this does not perform as well in terms of convergence. Install NVIDIA/apex , and use the “adamw_apex_fused” for high speed.

To reduce the memory footprint: use 8bit BNB optimizer, which reduces the memory usage to 75%. Just set the optim=“adamw_bnb_8bit”.

Use multiple GPUs

Hugginface uses DataParallelization right out of the gate.

To make the best of the multi-GPU usage on the DGX A100, use

python -m torch.distributed.launch --nproc_per_node 2 train.py

which makes use of the NVlink (i think)

Some notes on Docker

You may find my docker file under cluster/data/deri/dockerfiles/cuda it currently uses cuda 11.6 and installs all you need to train and inference for STT with Hugginface.