This project contains scripts/modules for distributed training
Based on current deep learning models, size of datasets, training methodologies; waiting for a model to train on a single GPU can be compared to waiting for an infant to take the first steps
Let's cut to the chase.
In this repository I try to simplify the concepts and (a few) implementations for distributed training
There are generally two ways to distributed computation across multiple devices:
- Data Parallelism: where a single model gets replicated on multiple devices or multiple machines. Each of them processes different batches of data, then they merge their results. There exist many variants of this setup, that differ in how the different model replicas merge results, in whether they stay in sync at every batch or whether they are more loosely coupled, etc.
- Model Parallelism: where different parts of a single model run on different devices, processing a single batch of data together. This works best with models that have a naturally-parallel architecture, such as models that feature multiple branches.
Before going ahead there are two cases we need to think about first:
- Single-host (or node/machine), multi-device training
- TF's MirroredStrategy
- Multi-worker distributed synchronous training or Multi-host, multi-device training
Certain concepts and implementations have been picked up raw, you find the same here
Pytorch has two ways to split models and data across multiple GPUs: nn.DataParallel
and nn.DistributedDataParallel
nn.DataParallel:
nn.DataParallel
uses one process to compute the model weights and then distribute them to each GPU during each batch, networking quickly becomes a bottle-neck and GPU utilization is often very low.
- Furthermore, `nn.DataParallel` requires that all the GPUs be on the same node and doesn't work with Apex for mixed-precision training
nn.DistributedDataParallel:
DistributedDataParallel
model reproduces the model onto each GPU- Machine has one processes per GPU, each model is controlled by each process
- GPUs can all be on the same node(machine) or across multiple nodes.
- Only gradients are passed between the processes/GPUs
- During training:
- Each process loads its own mini-batch from disk and passes it to its GPU
- Each GPU does its forward pass
- The gradients from all gpus are all-reduced across the GPUs
- (NOTE) Gradients for each layer do not depend on previoud layers, so the all-reduced is calculated concurrently with the backwards pass to further alleviate the networking bottleneck
- At the end of the backwards pass, every node has the average gradients (ensuring the model weights stay synchronized)
- During training:
Driver script:
On execution of main.py on either modules, the script launches a processs for every GPU.
Each process needs to know which gpu to use and where it ranks amongst all the processes that are running
Parameters:
- nodes: The total number of nodes or machines to be used (currently testes on 1)
- gpus: The number of GPUs on each node
- nr: The rank of the current node (machine) within all the nodes [0 to nodes-1]
Calculated:
- world_size: The total number of processes to run (gpus * nodes)
Training:
- Pytorch's
DistributedDataParallel
model reproduces the model onto each GPU nn.utils.data.DistributedSampler
makes sure that each process gets a different slice of training data while loading it- As the Sampler already distributes data differently, we can assign
shuffle=False
- As the Sampler already distributes data differently, we can assign
Drawbacks:
Here we try to train one model on multiple gpus, instead of one.
As elegant and simple the approach seems, there are few pitfalls.
Probelm:
- main GPU running out of memory
- As the first GPU will save all the different outputs for the different GPUs to calculate the loss, which inturn increase the memory usage
Solution:
- Reduce batch size
- Mixed precision training
Alternative to nn.DistributeDataParallel
is Nvidia's Apex
, for mixed precision
Mixed Precision: The use of lower-precision operations ( float16 and bfloat16 ) in a model during training to make it run faster and use less memory. Using mixed precision can improve performance by more than 3 times on modern GPUs and 60% on TPUs.[1].
The weight will remain at 32-bit, whereas other parameters like loss, gradients, etc will be computed at 16-bits. More of that here
- single node, multi-devices
$ python main.py --nodes 1 --gpus 2 --epochs 5
- multi-node, multi-devices
$ python main.py --nodes 2 --gpus 2 --epochs 5
Nvidia's Apex is the best alterative to traditional distributed training for the following reasons
- No
GPU running out of memory
errors, due to mixed precision training - Apex's
apex.parallel.DistributedDataParallel
internally allows only one GPU per process - Mixed-precision training required that the loss is scaled in order to prevent the gradients from underflowing. Apex does this automatically
- Replace nn.DistributedDataParallel with apex.parallel.DistributedDataParallel
-
Saving model/loading checkpoints
# Save checkpoint checkpoint = { 'model': model.state_dict(), 'optimizer': optimizer.state_dict(), 'amp': amp.state_dict() } torch.save(checkpoint, 'amp_checkpoint.pt') ... # Restore model = ... optimizer = ... checkpoint = torch.load('amp_checkpoint.pt') model, optimizer = amp.initialize(model, optimizer, opt_level=opt_level) model.load_state_dict(checkpoint['model']) optimizer.load_state_dict(checkpoint['optimizer']) amp.load_state_dict(checkpoint['amp']) # Continue training ...
- single node, multi-devices
$ python main.py --nodes 1 --gpus 2 --epochs 5
- multi-node, multi-devices
$ python main.py --nodes 2 --gpus 2 --epochs 5