Tutorial for Cluster Distributed Training using Slurm+Singularity
This tutorial covers how to setup a cluster of GPU instances on AWS and use Slurm to train neural networks with distributed data parallelism.
Create your own cluster
If you don’t have a cluster available, you can first create one on AWS.
ParallelCluster on AWS
We will primarily focus on using AWS ParallelCluster. The detailed docuent for setting up ParallelCluster can be found here.
First, make sure you have awscli installed.
$ pip install --upgrade awscli
$ aws configure
AWS will ask for aws key and password. Here we choose ca-central-1
.
AWS Access Key ID [None]: <aws_access_key>
AWS Secret Access Key [None]: <aws_secret_access_key>
Default region name []: ca-central-1
Default output format []: json
If you don’t know where to get AWS Access Key ID,
-
first sign in to the AWS Management Console and open the IAM console at https://console.aws.amazon.com/iam/. In the navigation pane, choose Users.
-
Choose the name of the user whose access keys you want to create, and then choose the Security credentials tab.
-
In the Access keys section, choose Create access key.
Next, install aws-parallelcluster
.
$ pip install aws-parallelcluster
Now, let’s configure the cluster!
$ pcluster configure
We choose us-west here. Try to choose the one that is close to you. Sometimes, there are weird bugs from servers that are far away from your location.
Allowed values for AWS Region ID:
1. ap-northeast-1
2. ap-northeast-2
3. ap-south-1
4. ap-southeast-1
5. ap-southeast-2
6. ca-central-1
7. eu-central-1
8. eu-north-1
9. eu-west-1
10. eu-west-2
11. eu-west-3
12. sa-east-1
13. us-east-1
14. us-east-2
15. us-west-1
16. us-west-2
AWS Region ID [ca-central-1]:
Then you need to choose the key-pair. The key pair can be created in advance in EC2 under Network & Security tab.
Allowed values for EC2 Key Pair Name:
1. xx
2. xx
3. xx
EC2 Key Pair Name [None]:
Next, we choose to use slurm as mentioned.
Allowed values for Scheduler:
1. sge
2. torque
3. slurm
4. awsbatch
Scheduler [slurm]:
And choose ubuntu1804 here.
Allowed values for Operating System:
1. alinux
2. alinux2
3. centos7
4. centos8
5. ubuntu1604
6. ubuntu1804
For demostration, we use 2 instances and a 4 gpu T4 server. The hourly price is about $7.824.
- Minimum cluster size: 2
- Maximum cluster size: 2
- Master instance type: t2.micro
- Compute instance type: g4dn.12xlarge
After the script is complete, you can check the status on CloudFormation https://aws.amazon.com/cloudformation/.
Create the Cluster with config
Now, you have configured ParallelCluster. You can change the configuration by editing /home/username/.parallelcluster/config
.
To create the cluster with the configuration:
pcluster create -c /home/username/.parallelcluster/config cluster-name
It normally takes about 15 minutes to complete.
Test Slurm
Now, your cluster is created. Let’s first ssh into the master node. Remember you need to specify the key used when creating the cluster. Also, if you encounter UNPROTECTED PRIVATE KEY FILE, change the permission of the key by typing chmod 600 key.pem
.
pcluster ssh cluster-name -i key.pem
You should be able to get in now. If not, please check each previous step carefully. Sometimes, there is some certain chance that it fails. Just redo some of the steps and the problem may be solved.
If you can ssh to login, try the following to see if all the nodes are working.
srun -N2 hostname
Singularity
The next part covers how to use singularity
Install
The installation requires Golang. It is better to follow the steps on the singularity github repo to install the latest version. Also, check the Install.md for details.
Note that we in a cluster now. We need to install for all the nodes in the cluster.
# master's environment variable will affect
# the compute node's environment variable
export VERSION=1.14.12 OS=linux ARCH=amd64
export NUM_NODES=2
srun -N${NUM_NODES} wget -O /tmp/go${VERSION}.${OS}-${ARCH}.tar.gz \
https://dl.google.com/go/go${VERSION}.${OS}-${ARCH}.tar.gz
srun -N${NUM_NODES} sudo tar -C /usr/local -xzf /tmp/go${VERSION}.${OS}-${ARCH}.tar.gz
Of course, you should also install it locally.
export VERSION=1.14.12 OS=linux ARCH=amd64
export NUM_NODES=2
wget -O /tmp/go${VERSION}.${OS}-${ARCH}.tar.gz \ https://dl.google.com/go/go${VERSION}.${OS}-${ARCH}.tar.gz
sudo tar -C /usr/local -xzf /tmp/go${VERSION}.${OS}-${ARCH}.tar.gz
Finally, set up the environment for all the nodes.
echo 'export GOPATH=${HOME}/go' >> ~/.bashrc
echo 'export PATH=/usr/local/go/bin:${PATH}:${GOPATH}/bin' >> ~/.bashrc
source ~/.bashrc
Then check if go is installed for all the nodes.
srun -N${NUM_NODES} go version
This should output something like:
go version go1.14.12 linux/amd64
If your filesystem for the user is shared across nodes (normmally it is), things are a little bit tricky, because the compilation should only run on a single node to avoid errors. Also, singularity forces to install under the GO_PATH
, which limits other possible workarounds. Therefore, we will first compile on the master node. Note that here the user folder under /home
are shared across nodes!
# On Master Node
export VERSION=3.6.4 # adjust this as necessary \
mkdir -p $GOPATH/src/github.com/sylabs
cd $GOPATH/src/github.com/sylabs &&
# Download files
wget https://github.com/sylabs/singularity/releases/download/v${VERSION}/singularity-${VERSION}.tar.gz
tar -xzf singularity-${VERSION}.tar.gz
# Compile
cd ./singularity
./mconfig
make -C ./builddir
sudo make -C ./builddir install
Now, ./builddir
should have the compiled files. Assume that all your compute nodes have the same architecture as the master node. You can install
the compiled Singularity on all of them directly!
Let’s first create a shell script.
cd
cat > install_singularity.sh << EOF
#!/bin/bash
cd ~/go/src/github.com/sylabs/singularity
sudo make -C ./builddir install
EOF
Run it on all the nodes.
chmod 755 install_singularity.sh
srun -N${NUM_NODES} ./install_singularity.sh
However, sometimes it seems that there are still errors. If error persists, use the following command instead.
for host in $(sinfo -N -ho "%N");
do
ssh ${host} "bash -ic ./install_singularity.sh";
done
Test if singularity is installed:
srun -N${NUM_NODES} singularity version
Create a Container
Singularity container created from docker. Therefore, we can build a docker container first and later convert it into a singularity container.
Here, we show an example of how to create a custom pytorch docker.
First, prepare a Dockerfile
. We will use the pre-built docker containers from NGC (Nvidia GPU Cloud) as a starting point.
FROM nvcr.io/nvidia/pytorch:20.11-py3
# Install necessary packages
RUN apt-get update && \
apt-get install -y vim git sudo wget
# Setup new user and sudo permission
RUN adduser --disabled-password --gecos '' ubuntu && \
adduser ubuntu sudo && \
echo '%sudo ALL=(ALL) NOPASSWD:ALL' >> /etc/sudoers
# Setup colorful prompt
RUN sed -i 's/#force_color_prompt=yes/force_color_prompt=yes/g' /home/ubuntu/.bashrc
# Get into the user directory
USER ubuntu
WORKDIR /home/ubuntu
# Initialize anaconda
RUN /opt/conda/bin/conda init
After you have created the Dockerfile, build it into a new image named torch, and run a container based on the new image.
docker build -t torch .
docker run --gpus all --rm torch nvidia-smi
You should be able to see the following:
=============
== PyTorch ==
=============
NVIDIA Release 20.11 (build 17345815)
PyTorch Version 1.8.0a0+17f8c32
# some other outputs
...
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.80.02 Driver Version: 450.80.02 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 Off | 00000000:00:1E.0 Off | 0 |
| N/A 30C P0 19W / 70W | 0MiB / 15109MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
| 0 Tesla T4 Off | 00000000:00:1E.0 Off | 0 |
| N/A 41C P0 20W / 70W | 0MiB / 15109MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Next, we can convert the image we just built into a Singularity image.
sudo singularity build torch.simg docker-daemon://torch:latest
After about 10 minutes, you should see a file torch.simg
in the current working directory. Let’s run it.
singularity shell --nv torch.simg
Finally, you have set up a singularity container that is ready for the cluster training.
PyTorch Distributed Training
This part shows how distributed training works on PyTorch. Here is an example code for running MNIST classification task.
import os
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torch.distributed as dist
from torch.distributed import get_rank
from torchvision import datasets, transforms
# pylint:disable=no-member
class Hyperparams:
random_seed = 123
batch_size = 32
test_batch_size = 32
lr = 1e-3
epochs = 10
save_model = False
log_interval = 100
num_gpus_per_node = 2
num_nodes = 1
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.conv1 = nn.Conv2d(1, 32, 3, 1)
self.conv2 = nn.Conv2d(32, 64, 3, 1)
self.dropout1 = nn.Dropout(0.25)
self.dropout2 = nn.Dropout(0.5)
self.fc1 = nn.Linear(9216, 128)
self.fc2 = nn.Linear(128, 10)
def forward(self, x):
x = self.conv1(x)
x = F.relu(x)
x = self.conv2(x)
x = F.relu(x)
x = F.max_pool2d(x, 2)
x = self.dropout1(x)
x = torch.flatten(x, 1)
x = self.fc1(x)
x = F.relu(x)
x = self.dropout2(x)
x = self.fc2(x)
output = F.log_softmax(x, dim=1)
return output
def train(args, model, device, train_loader, optimizer, epoch):
model.train()
for batch_idx, (data, target) in enumerate(train_loader):
data, target = data.to(device), target.to(device)
optimizer.zero_grad()
output = model(data)
loss = F.nll_loss(output, target)
loss.backward()
optimizer.step()
if Hyperparams.rank == 0 and batch_idx % args.log_interval == 0:
print(
'Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
epoch, batch_idx * len(data), len(train_loader.dataset), 100. * batch_idx / len(train_loader),
loss.item()
)
)
def test(model, device, test_loader):
model.eval()
test_loss = 0
correct = 0
with torch.no_grad():
for data, target in test_loader:
data, target = data.to(device), target.to(device)
output = model(data)
test_loss += F.nll_loss(output, target, reduction='sum').item() # sum up batch loss
pred = output.argmax(dim=1, keepdim=True) # get the index of the max log-probability
correct += pred.eq(target.view_as(pred)).sum().item()
test_loss /= len(test_loader.dataset)
print(
'\nTest set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'.format(
test_loss, correct, len(test_loader.dataset), 100. * correct / len(test_loader.dataset)
)
)
def main():
# torch.manual_seed(Hyperparams.random_seed)
local_rank = int(os.environ.get("LOCAL_RANK", 0))
rank = int(os.environ.get("RANK", 0))
Hyperparams.rank = rank
world_size = int(os.environ.get("WORLD_SIZE", 1))
num_gpus_per_node = max(rank - local_rank, world_size)
num_nodes = world_size // num_gpus_per_node
if world_size > 1:
torch.cuda.set_device(local_rank)
dist.init_process_group(backend="nccl", init_method='env://')
node_rank = rank // num_gpus_per_node
print(f"Initialized Rank:{dist.get_rank()} on Node Rank:{node_rank}")
device = torch.device("cuda")
transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.1307, ), (0.3081, ))])
if rank == 0:
dataset1 = datasets.MNIST('../data', train=True, download=True, transform=transform)
if world_size > 1:
dist.barrier()
else:
if world_size > 1:
dist.barrier()
dataset1 = datasets.MNIST('../data', train=True, download=False, transform=transform)
dataset2 = datasets.MNIST('../data', train=False, transform=transform)
train_loader = torch.utils.data.DataLoader(dataset1, batch_size=Hyperparams.batch_size, shuffle=True)
test_loader = torch.utils.data.DataLoader(dataset2, batch_size=Hyperparams.test_batch_size)
model = Net().to(device)
if world_size > 1:
model = nn.parallel.DistributedDataParallel(model, device_ids=[local_rank], output_device=local_rank)
optimizer = optim.AdamW(model.parameters(), lr=Hyperparams.lr)
for epoch in range(1, Hyperparams.epochs + 1):
train(Hyperparams, model, device, train_loader, optimizer, epoch)
if rank == 0:
test(model, device, test_loader)
if rank == 0 and Hyperparams.save_model:
torch.save(model.state_dict(), "mnist_cnn.pt")
if __name__ == "__main__":
main()
We will take advantage of a utility script torch.distributed.launch
in the PyTorch repository.
Let’s run the mnist task on single node with 2 gpus first. Let’s start a singularity container first.
singularity exec --nv torch.simg python -m torch.distributed.launch --nproc_per_node=2 mnist.py
This should output the confirmation of each gpu process initialization.
Initialized Rank:0 on Node Rank:0
Initialized Rank:1 on Node Rank:0
Slurm
Now, let’s go back to our AWS cluster.
pcluster ssh cluster-name -i key.pem
Create the Singularity container
You can upload the singularity container created locally. Alternatively, we can create it on the master node. It is prefered if you have much slower internet connection than the AWS. Here are the steps:
sudo snap install docker
# make sure you have your Dockerfile
sudo docker build -t torch .
# create singularity image
sudo singularity build torch.simg docker-daemon://torch:latest