The goal of this project is to benchmark perforamance of PyTorch and MXNet on a recurrent network based model using single and multi-GPU configurations. Findings are reported in this [blog post] (

Hardware and Software Configuration

For easy reproducibility, results have been reported on an EC2 instance using a community Deep Learning image. The results were also run on Borealis AI machines to see how performance varies.

Amazon EC2

Spec Value
AMI Id ami-0a9fac70
EC2 instance type p3.2xlarge
GPU type Tesla V100

Borealis AI machines

Spec Value
Architecture x86_64
GPUs Nvidia GeForce GTX 1080 Ti
System memory 30GB
CPU Model Intel(R) Xeon(R) CPU E5-2683 v4 @ 2.10GHz
OS Ubuntu 16
Python 3.6.3
Pytorch 0.3.0.post4
MXNet mxnet-cu90==1.0.0.post1
Nvidia GPU Driver v384.9
Cuda release 9.0, V9.0.176


Amazon EC2

Launch an instance of your choice using the Deep Learning AMI (ami-9ba7c4e1). The results below were posted using a p3.8xlarge instance since we wanted to benchmark on the Volta architecutre with multiple GPUs.

ssh <ec2-instance-ip>

cd pytorch
# open and check parameters are what you need
source activate pytorch_p36
./ example.pkl

## MXNET ##
cd mxnet
# open and check parameters are what you need
source activate mxnet_p36
./ example.pkl

Custom Setup

Set up a virtual environment and install the following libraries. Note that the commands below install the versions of the framework used in this benchmark.

Install PyTorch:

pip install

Install MXNet:

pip install mxnet-cu90


Navigate to one of the pytorch or mxnet directories and run as follows.

cd <framework-name>
./ example.pkl is a bash script that wraps the python run script To modify the arguments passed open and change the values accordingly. The full list of supported arguments are:

python <framework>/ --help

Model and hyperparameters

See <framework>/ for details on the model.

Hyperparameters are as follows:

  • Optimizer: Adam
  • Learning rate: 0.001. A Factored learning schedule was used, which reduces learning rate by 10% every 30 epochs.
  • Weight initializations: Xavier.normal with gain=2 using both N<sub>in</sub> and N<sub>out</sub>.
  • Batch Size: Batch sizes of 256, 512 and 1024 were experimented with to plot convergence rates


Pleae see the blog above for the results.