This tutorial covers how to setup a cluster of GPU instances on AWS and use Slurm to train neural networks with distributed data parallelism.

Create your own cluster

If you don’t have a cluster available, you can first create one on AWS.

ParallelCluster on AWS

We will primarily focus on using AWS ParallelCluster. The detailed docuent for setting up ParallelCluster can be found here.

First, make sure you have awscli installed.

$ pip install --upgrade awscli
$ aws configure

AWS will ask for aws key and password. Here we choose ca-central-1.

AWS Access Key ID [None]: <aws_access_key>
AWS Secret Access Key [None]: <aws_secret_access_key>
Default region name []: ca-central-1
Default output format []: json

If you don’t know where to get AWS Access Key ID,

first sign in to the AWS Management Console and open the IAM console at https://console.aws.amazon.com/iam/. In the navigation pane, choose Users.
Choose the name of the user whose access keys you want to create, and then choose the Security credentials tab.
In the Access keys section, choose Create access key.

Next, install aws-parallelcluster.

$ pip install aws-parallelcluster

Now, let’s configure the cluster!

$ pcluster configure

We choose us-west here. Try to choose the one that is close to you. Sometimes, there are weird bugs from servers that are far away from your location.

Allowed values for AWS Region ID:
ap-northeast-1
ap-northeast-2
ap-south-1
ap-southeast-1
ap-southeast-2
ca-central-1
eu-central-1
eu-north-1
eu-west-1
eu-west-2
eu-west-3
sa-east-1
us-east-1
us-east-2
us-west-1
us-west-2
AWS Region ID [ca-central-1]:

Then you need to choose the key-pair. The key pair can be created in advance in EC2 under Network & Security tab.

Allowed values for EC2 Key Pair Name:
1. xx
2. xx
3. xx
EC2 Key Pair Name [None]:

Next, we choose to use slurm as mentioned.

Allowed values for Scheduler:
1. sge
2. torque
3. slurm
4. awsbatch
Scheduler [slurm]: 

And choose ubuntu1804 here.

Allowed values for Operating System:
alinux
alinux2
centos7
centos8
ubuntu1604
ubuntu1804

For demostration, we use 2 instances and a 4 gpu T4 server. The hourly price is about $7.824.

Minimum cluster size: 2
Maximum cluster size: 2
Master instance type: t2.micro
Compute instance type: g4dn.12xlarge

After the script is complete, you can check the status on CloudFormation https://aws.amazon.com/cloudformation/.

Create the Cluster with config

Now, you have configured ParallelCluster. You can change the configuration by editing /home/username/.parallelcluster/config.

To create the cluster with the configuration:

pcluster create -c /home/username/.parallelcluster/config cluster-name

It normally takes about 15 minutes to complete.

Test Slurm

Now, your cluster is created. Let’s first ssh into the master node. Remember you need to specify the key used when creating the cluster. Also, if you encounter UNPROTECTED PRIVATE KEY FILE, change the permission of the key by typing chmod 600 key.pem.

pcluster ssh cluster-name -i key.pem

You should be able to get in now. If not, please check each previous step carefully. Sometimes, there is some certain chance that it fails. Just redo some of the steps and the problem may be solved.

If you can ssh to login, try the following to see if all the nodes are working.

srun -N2 hostname

Singularity

The next part covers how to use singularity

alt text

Install

The installation requires Golang. It is better to follow the steps on the singularity github repo to install the latest version. Also, check the Install.md for details.

Note that we in a cluster now. We need to install for all the nodes in the cluster.

# master's environment variable will affect 
# the compute node's environment variable
export VERSION=1.14.12 OS=linux ARCH=amd64
export NUM_NODES=2

srun -N${NUM_NODES} wget -O /tmp/go${VERSION}.${OS}-${ARCH}.tar.gz \
    https://dl.google.com/go/go${VERSION}.${OS}-${ARCH}.tar.gz

srun -N${NUM_NODES} sudo tar -C /usr/local -xzf /tmp/go${VERSION}.${OS}-${ARCH}.tar.gz

Of course, you should also install it locally.

export VERSION=1.14.12 OS=linux ARCH=amd64
export NUM_NODES=2

wget -O /tmp/go${VERSION}.${OS}-${ARCH}.tar.gz \    https://dl.google.com/go/go${VERSION}.${OS}-${ARCH}.tar.gz

sudo tar -C /usr/local -xzf /tmp/go${VERSION}.${OS}-${ARCH}.tar.gz

Finally, set up the environment for all the nodes.

echo 'export GOPATH=${HOME}/go' >> ~/.bashrc
echo 'export PATH=/usr/local/go/bin:${PATH}:${GOPATH}/bin' >> ~/.bashrc
source ~/.bashrc

Then check if go is installed for all the nodes.

srun -N${NUM_NODES} go version

This should output something like:

go version go1.14.12 linux/amd64

If your filesystem for the user is shared across nodes (normmally it is), things are a little bit tricky, because the compilation should only run on a single node to avoid errors. Also, singularity forces to install under the GO_PATH, which limits other possible workarounds. Therefore, we will first compile on the master node. Note that here the user folder under /home are shared across nodes!

# On Master Node
export VERSION=3.6.4 # adjust this as necessary \
mkdir -p $GOPATH/src/github.com/sylabs
cd $GOPATH/src/github.com/sylabs && 
# Download files
wget https://github.com/sylabs/singularity/releases/download/v${VERSION}/singularity-${VERSION}.tar.gz 
tar -xzf singularity-${VERSION}.tar.gz 
# Compile
cd ./singularity 
./mconfig 
make -C ./builddir
sudo make -C ./builddir install

Now, ./builddir should have the compiled files. Assume that all your compute nodes have the same architecture as the master node. You can install the compiled Singularity on all of them directly! Let’s first create a shell script.

cd
cat > install_singularity.sh << EOF
#!/bin/bash
cd ~/go/src/github.com/sylabs/singularity
sudo make -C ./builddir install
EOF

Run it on all the nodes.

chmod 755 install_singularity.sh
srun -N${NUM_NODES} ./install_singularity.sh

However, sometimes it seems that there are still errors. If error persists, use the following command instead.

for host in $(sinfo -N -ho "%N"); 
do 
    ssh ${host} "bash -ic ./install_singularity.sh";
done

Test if singularity is installed:

srun -N${NUM_NODES} singularity version

Create a Container

Singularity container created from docker. Therefore, we can build a docker container first and later convert it into a singularity container.

Here, we show an example of how to create a custom pytorch docker.

First, prepare a Dockerfile. We will use the pre-built docker containers from NGC (Nvidia GPU Cloud) as a starting point.

FROM nvcr.io/nvidia/pytorch:20.11-py3

# Install necessary packages
RUN apt-get update && \
    apt-get install -y vim git sudo wget

# Setup new user and sudo permission
RUN adduser --disabled-password --gecos '' ubuntu && \
    adduser ubuntu sudo && \
    echo '%sudo ALL=(ALL) NOPASSWD:ALL' >> /etc/sudoers

# Setup colorful prompt
RUN sed -i 's/#force_color_prompt=yes/force_color_prompt=yes/g' /home/ubuntu/.bashrc

# Get into the user directory
USER ubuntu
WORKDIR /home/ubuntu

# Initialize anaconda
RUN /opt/conda/bin/conda init

After you have created the Dockerfile, build it into a new image named torch, and run a container based on the new image.

docker build -t torch .
docker run --gpus all --rm torch nvidia-smi

You should be able to see the following:

=============
== PyTorch ==
=============

NVIDIA Release 20.11 (build 17345815)
PyTorch Version 1.8.0a0+17f8c32

# some other outputs
...

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.80.02    Driver Version: 450.80.02    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:00:1E.0 Off |                    0 |
| N/A   30C    P0    19W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
|   0  Tesla T4            Off  | 00000000:00:1E.0 Off |                    0 |
| N/A   41C    P0    20W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Next, we can convert the image we just built into a Singularity image.

sudo singularity build torch.simg docker-daemon://torch:latest

After about 10 minutes, you should see a file torch.simg in the current working directory. Let’s run it.

singularity shell --nv torch.simg

Finally, you have set up a singularity container that is ready for the cluster training.

PyTorch Distributed Training

This part shows how distributed training works on PyTorch. Here is an example code for running MNIST classification task.

import os
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torch.distributed as dist
from torch.distributed import get_rank
from torchvision import datasets, transforms

# pylint:disable=no-member


class Hyperparams:
    random_seed = 123
    batch_size = 32
    test_batch_size = 32
    lr = 1e-3
    epochs = 10
    save_model = False
    log_interval = 100
    num_gpus_per_node = 2
    num_nodes = 1


class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 32, 3, 1)
        self.conv2 = nn.Conv2d(32, 64, 3, 1)
        self.dropout1 = nn.Dropout(0.25)
        self.dropout2 = nn.Dropout(0.5)
        self.fc1 = nn.Linear(9216, 128)
        self.fc2 = nn.Linear(128, 10)

    def forward(self, x):
        x = self.conv1(x)
        x = F.relu(x)
        x = self.conv2(x)
        x = F.relu(x)
        x = F.max_pool2d(x, 2)
        x = self.dropout1(x)
        x = torch.flatten(x, 1)
        x = self.fc1(x)
        x = F.relu(x)
        x = self.dropout2(x)
        x = self.fc2(x)
        output = F.log_softmax(x, dim=1)
        return output


def train(args, model, device, train_loader, optimizer, epoch):
    model.train()
    for batch_idx, (data, target) in enumerate(train_loader):
        data, target = data.to(device), target.to(device)
        optimizer.zero_grad()
        output = model(data)
        loss = F.nll_loss(output, target)
        loss.backward()
        optimizer.step()

        if Hyperparams.rank == 0 and batch_idx % args.log_interval == 0:
            print(
                'Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
                    epoch, batch_idx * len(data), len(train_loader.dataset), 100. * batch_idx / len(train_loader),
                    loss.item()
                )
            )


def test(model, device, test_loader):
    model.eval()
    test_loss = 0
    correct = 0
    with torch.no_grad():
        for data, target in test_loader:
            data, target = data.to(device), target.to(device)
            output = model(data)
            test_loss += F.nll_loss(output, target, reduction='sum').item()  # sum up batch loss
            pred = output.argmax(dim=1, keepdim=True)  # get the index of the max log-probability
            correct += pred.eq(target.view_as(pred)).sum().item()

    test_loss /= len(test_loader.dataset)

    print(
        '\nTest set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'.format(
            test_loss, correct, len(test_loader.dataset), 100. * correct / len(test_loader.dataset)
        )
    )


def main():
    # torch.manual_seed(Hyperparams.random_seed)

    local_rank = int(os.environ.get("LOCAL_RANK", 0))
    rank = int(os.environ.get("RANK", 0))
    Hyperparams.rank = rank
    world_size = int(os.environ.get("WORLD_SIZE", 1))
    num_gpus_per_node = max(rank - local_rank, world_size)
    num_nodes = world_size // num_gpus_per_node

    if world_size > 1:
        torch.cuda.set_device(local_rank)
        dist.init_process_group(backend="nccl", init_method='env://')
        node_rank = rank // num_gpus_per_node
        print(f"Initialized Rank:{dist.get_rank()} on Node Rank:{node_rank}")

    device = torch.device("cuda")

    transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.1307, ), (0.3081, ))])

    if rank == 0:
        dataset1 = datasets.MNIST('../data', train=True, download=True, transform=transform)
        if world_size > 1:
            dist.barrier()
    else:
        if world_size > 1:
            dist.barrier()
        dataset1 = datasets.MNIST('../data', train=True, download=False, transform=transform)

    dataset2 = datasets.MNIST('../data', train=False, transform=transform)

    train_loader = torch.utils.data.DataLoader(dataset1, batch_size=Hyperparams.batch_size, shuffle=True)
    test_loader = torch.utils.data.DataLoader(dataset2, batch_size=Hyperparams.test_batch_size)

    model = Net().to(device)

    if world_size > 1:
        model = nn.parallel.DistributedDataParallel(model, device_ids=[local_rank], output_device=local_rank)

    optimizer = optim.AdamW(model.parameters(), lr=Hyperparams.lr)

    for epoch in range(1, Hyperparams.epochs + 1):
        train(Hyperparams, model, device, train_loader, optimizer, epoch)
        if rank == 0:
            test(model, device, test_loader)

    if rank == 0 and Hyperparams.save_model:
        torch.save(model.state_dict(), "mnist_cnn.pt")


if __name__ == "__main__":
    main()

We will take advantage of a utility script torch.distributed.launch in the PyTorch repository.

Let’s run the mnist task on single node with 2 gpus first. Let’s start a singularity container first.

singularity exec --nv torch.simg python -m torch.distributed.launch --nproc_per_node=2 mnist.py

This should output the confirmation of each gpu process initialization.

Initialized Rank:0 on Node Rank:0
Initialized Rank:1 on Node Rank:0

Slurm

Now, let’s go back to our AWS cluster.

pcluster ssh cluster-name  -i key.pem

Create the Singularity container

You can upload the singularity container created locally. Alternatively, we can create it on the master node. It is prefered if you have much slower internet connection than the AWS. Here are the steps:

sudo snap install docker

# make sure you have your Dockerfile 
sudo docker build -t torch .

# create singularity image
sudo singularity build torch.simg docker-daemon://torch:latest