HTTPS SSH

parry

Parry

An implementation of our approach for fence placement in C/C++ programs using declarative execution orders.

Questions / Issues

If you are attempting to use Parry and at any time have questions or issues please do not hesitate to email us.

If you run into a bug with Parry please consider filing a detailed issue so that we can track progress on a fix.

Usage

The parry command takes as input a source file in the source file's project directory (so clang can compile it to IR), a procedure name, orders, an architecture and optionally an output file. If no output is specified the updated source will be dumped to stdout.

# NOTE <arch> can be in {tso, pso}
# NOTE '<orders>' should be a sequence of line number pairs
#      e.g.: two orders `1:load-2:store,10:store-20:store`
./bin/parry <procedure name> <source.c> <orders> <arch> [--output=<output_path> --debug=graph]

Getting Started

The following instructions should work on a recent version of OS X or Linux as the host operating system. They have been tested on the most recent version of OS X, Yosemite.

Environment Setup

To get up and running quickly working with Parry we have provided a Vagrantfile and basic provisioning script for use with Vagrant and Virtualbox. Additionally we have posted a fully provision virtual machine image for use with Vagrant.

To begin:

  1. Download and install Virtualbox v4.3.26
  2. Download and install Vagrant v1.7.2
  3. git clone https://bitbucket.org/ucla-pls/parry.git, to clone the repository OR unzip from the download to some directory, hereafter $PROJECT_DIR.
  4. cd $PROJECT_DIR && vagrant up, to provision the virtual machine.

Note if you are part of the AEC for OOPSLA 2015 you will have already downloaded the project as zip file from the download link in step three and $PROJECT_DIR will be the unzipped folder.

The vagrant up command will set up the virtual machine by downloading a pre-provisioned virtual machine (which is quite large at ~2Gb). This step will take some time.

Once complete, you can enter the virtual machine to finalize the install:

  1. cd $PROJECT_DIR && vagrant ssh, to enter the virtual machine.
  2. cd /vagrant, to enter the shared folder of the project directory.
  3. virtualenv --system-site-packages env, for local Python dependencies
  4. source env/bin/activate, to load the Python virtual environment
  5. python setup.py develop, to install Python dependencies

Host Setup

Alternately if you're interested in benchmarking the performance of the tool or running the algorithm performance benchmarks you can install the necessary dependencies on a Ubuntu 14.04 host, by running the bin/setup.sh script. You will need to have super-user privileges. So on an appropriate Ubuntu host:

  1. cd $PROJECT_DIR, to enter the project directory
  2. bash bin/setup.sh, to install dependencies
  3. virtualenv --system-site-packages env, for local Python dependencies
  4. source env/bin/activate, to load the Python virtual environment
  5. python setup.py develop, to install Python dependencies

Please see Development Setup section below for more information about what will be installed by bin/setup.sh (there's a fair amount).

Example Use

At this point the tool is ready to use. It can be invoked in the following way from the project directory:

./bin/parry \
  TxLoad \                                    # procedure
  examples/tl2/src/tl2.c \                    # source file
  1991:load-1993:load,1993:load-1995:load \   # order definitions
  arm \                                       # target architecture
  --output scratch/tl2-TxLoad-fenced.c \      # output source file with fences
  --ccflags -DTL2_EAGER                       # compiler flags passed to clang
  --debug=graph                               # print debug output w/ graphs

This invocation, expanded from the run_tl2_eager function in bin/fence-examples.sh, inserts fences in the TxLoad procedure of the TL2 transactional memory algorithm compiled with -DTL2_EAGER.

The following snippet is taken from the TxLoad procedure from the source for the TL2 algorithm in examples/tl2/src/tl2.c from lines 1989-1997.

1989:    ...
1990:     vwLock cv = LDLOCK(LockFor);
1991: *   vwLock rdv = cv & ~LOCKBIT;
1992: |   MEMBARLDLD();
1993: * * Valu = LDNF(Addr);
1994:   | MEMBARLDLD();
1995:   * if ((rdv <= Self->rv && LDLOCK(LockFor) == rdv) ||
1996:        ((cv & LOCKBIT) && (((AVPair*)rdv)->Owner == Self)))
1997:     ...

The orders defined in the example invocation of Parry, 1991:load-1993:load and 1993:load-1995:load, correspond with the two pairs of *-* in the code sample above.

Step by Step

Here we detail how to reproduce the outcomes in the evaluation section of the paper. There are three main outcomes in the evaluation: fences insertions, tool performance/execution time, and fenced algorithm performance benchmarks.

Each of the outcomes has its own sub-section and each section includes information on generating the data and how to compare it with the results presented in the paper.

Fence Placements

All of the fence placements as described in the evaluation section of the paper can be run with using the bin/fence-examples.sh script.

# from the host
cd $PROJECT_DIR
vagrant ssh

# once inside the virtual machine
cd /vagrant
source env/bin/activate
bash bin/fence-examples.sh

Fence Placement Data

To reproduce the fence insertion results for each algorithm as described in the paper, run the bin/fence-examples.sh script and then check for each of the following x86/arm files.

algorithm figure arch. file
Dekker Figure 7 arm examples/classic/compiled/dekker-arm.c
x86 examples/classic/compiled/dekker-x86.c
original examples/classic/src/dekker.c
Lamport Figure 7 arm examples/classic/compiled/lamport-arm.c
x86 examples/classic/compiled/lamport-x86.c
original examples/classic/src/lamport.c
Parker Figure 7 arm examples/classic/compiled/parker-arm.c
x86 examples/classic/compiled/parker-x86.c
original examples/classic/src/parker.c
Peterson Figure 7 arm examples/classic/compiled/peterson-arm.c
x86 examples/classic/compiled/peterson-x86.c
original examples/classic/src/peterson.c
TL2 Figure 11 arm examples/tl2/compiled/tl2-arm.c
x86 examples/tl2/compiled/tl2-x86.c
original examples/tl2/src/tl2.c
TL2 Eager Figure 11 arm examples/tl2/compiled/tl2-eager-arm.c
x86 examples/tl2/compiled/tl2-eager-x86.c
original examples/tl2/src/tl2.c
RSTM ByteEager Figure 14 arm examples/rstm/compiled/byteager-arm.cpp
x86 examples/rstm/compiled/byteager-x86.cpp
original examples/rstm/src/libstm/algs/byteager.cpp

Comparing Results

Each table diagram in the paper has the defined orders in the left column and the resulting fences by architecture in the other columns. Each cell entry can be read as <line-number>:<fence instruction> where the line number is line in the original source after which the fence assembly has been inserted.

For example the second fence in the examples/classic/compiled/dekker-arm.c file in the thr1 method appears after line 14 but the table in Figure 7 of the paper says line 13. That's because the fence is inserted after line 13 of the original source file examples/classic/src/dekker.c but in the Parry output there is another inserted fence above line 13, offsetting the result by one line.

It may be helpful to have the original source for each algorithm open with the altered source when examining the output. The original source location is noted in the table above for convenience.

Note the current result set is checked into the repository (if you cloned it).

Note In the paper the orders in TL2's TxCommit (Figure 11) procedure are 1555 (store), 1625 (store) and 1596 (store), 1625 (store). Lines 1555 and 1596 are macros which expand to include the platform specific compare and swap implemented in examples/tl2/src/platform_armv7.h and examples/tl2/src/platform_x86.h. As a result the bin/fence-examples.sh script references the line numbers of the store for the compare and swap implementation in those files (51 and 58 respectively).

Understanding the Examples Script

The fence-examples.sh script includes a function for each algorithm detailed in the evaluation section of the paper. In each function is a number of invocations of Parry, one for each procedure of the corresponding algorithm.

Notes:

  1. Each function in fence-examples.sh runs Parry once for each method of each algorithm in a particular order: bottom up by the line number in the file. That is the first method that each function will operate on has the highest line number. This is because Parry will insert a fence into the output which will offset the line numbers below it. By proceeding from the bottom up we can simply reference the line numbers of the statements as they exist in the original source file.

  2. The first Parry invocation in each fence-examples.sh function operates on the original file but the subsequent invocations operate on the intermediate output of the previous invocation.

Experimentation

You may wish to alter some of the examples to see see the effect that it has on the output of the tool. Take the following example from TL2's TxCommit:

TODO: layout how to experiment with the examples to get different results during fences TODO: point them to the debug output include graphs TODO: code layout including algorithms and points of intereset

Run-time Performance Benchmarks

When Parry is run with --debug=true or --debug=graph the debugger output will include execution times for various operations. For example, running the bin/fence-examples.sh will produce output similar to the following when piped through grep:

$ bash bin/fence-examples.sh | grep took
...
DEBUG:  procedure compile_src took 0.002835 seconds
DEBUG:  procedure clean_dot_cfg took 0.00619 seconds
DEBUG:  procedure create_po_graph took 0.02639 seconds
DEBUG:  procedure full_elim took 0.117095 seconds
DEBUG:  procedure lp took 0.009916 seconds
DEBUG:  procedure opt took 0.266256 seconds
DEBUG:  procedure run took 0.308365 seconds

real    0m7.102s
user    0m2.570s
sys     0m2.045s
...

The final result is the time command output for the Parry invocation. Otherwise the output comes from decorated Python procedures using the difference in Python's time.clock() values.

Timing Data

The bin/timing.sh script automates the process of running the bin/fence-examples.sh script 100 times and recording each of the timing values so that the data can be aggregated. The output can be found in /tmp/parry-fence-times/aggregate/ folder for each algorithm, architecture and algorithm procedure.

$ bash bin/timing.sh
...
$ ls /tmp/parry-fence-times/aggregate/
peterson-arm-thr1.txt
peterson-arm-thr2.txt
peterson-x86-thr1.txt
peterson-x86-thr2.txt
...

The data in each file is arranged in a csv format suitable for import into a spreadsheet:

$ cat /tmp/parry-fence-times/aggregate/peterson-arm-thr1.txt
algo-arch,method,compile_src,clean_dot_cfg,create_po_graph,full_elim,lp,opt,run,real
peterson-arm,thr1,0.002566,0.005348,0.031988,0.103403,0.013987,0.26351,0.310815,6.698
peterson-arm,thr1,0.003316,0.006939,0.028912,0.106081,0.012475,0.250796,0.297872,6.695
...

Note: each line is arranged with the Python procedure times first and then the time command results at the end.

Comparing Results

The results in Figures 9 and 10 of the paper can be approximated (accounting for the difference in the host system performance profiles) by aggregating the data for the TL2 and ByteEager methods.

More precisely, the results in the paper take each of the main components of Parry (compile_src, create_po_graph, full_elim, lp) averaged over 100 runs for each procedure of the target algorithm (RSTM and TL2). Then the averaged values for each individual procedure of the target algorithm are summed to provide an account of fence insertion for the whole algorithm (all the procedures).

Note Each component's time is a percentage of the run procedure time value in the bin/timing.sh output which we treat as the total time. This is the entry Python procedure for the tool (__main__ calls run) and using this time as the total allows us to avoid including the Python startup overhead in our summed totals for each benchmark algorithm.

The data used in the pie charts in Figures 9 and 10 can be found in the following spreadsheets for reference along with the means of aggregation.

Figure 9, results using exponential order elimination:

  1. RSTM for arm
  2. RSTM for x86
  3. TL2 for arm
  4. TL2 for x86

Figure 10, results using linear order elimination:

  1. RSTM for arm
  2. RSTM for x86
  3. TL2 for arm
  4. TL2 for x86

Algorithm Performance Benchmarks

Reproducing the algorithm performance benchmarks requires two host systems. One x86 and one ARMv7. Compiling the algorithms and benchmarks for each platform and then testing them is involved but much of it has been automated.

Each algorithm has a benchmark script that handles compiling and benchmarking the original version of the algorithm and the version with fences inserted by Parry. The results of running each benchmark from the STAMP benchmark suite (ssca, labyrinth, kmeans, vac low, vac high, and intruder) 100 times are automatically collected.

RSTM

To run the STAMP benchmarks for the RSTM ByteEager algorithm take the following steps on a 32 bit ARMv7 host:

  1. cd $PROJECT_DIR/examples/rstm/
  2. bash bin/benchmarks.sh STM32 arm, to build and execute the benchmarks.

The bin/benchmarks.sh script will run the RSTM cmake, build the benchmark binaries, and run the STAMP benchmarks for both the original version of the algorithm and the version with fences inserted by Parry.

Note the STM32 is passed to the benchmark script because the binaries will be compiled with this filename postfix on a 32bit platform. On a 64bit host STM64 must be used.

After the very long benchmark run for both versions of the algorithm the results for each benchmark will be available in two folders:

$PROJECT_DIR/examples/rstm/build/stamp-0.9.10/notes/benchmarks/rstm/arm/unopt/
$PROJECT_DIR/examples/rstm/build/stamp-0.9.10/notes/benchmarks/rstm/arm/opt/

The opt version is the version where Parry has inserted the fences, the unopt version is the original.

To run the benchmarks for x86 on either 32 bit or 64 bit platforms replace arm in the orignal command and use the appropriate postfix.

Note We have only run the benchmarks on 32 bit ARMv7 and 64 bit x86 machines.

TL2

To run the STAMP benchmarks for the TL2 algorithm take the following steps on a 32 bit ARMv7 host:

  1. cd $PROJECT_DIR/examples/tl2/
  2. bash bin/benchmarks.sh arm, to build and execute the benchmarks.

The bin/benchmarks.sh script will copy the stamp benchmark to the build directory, built the TL2 algorithm, build the benchmark binaries, and run the STAMP benchmarks for both the original version of the algorithm and the version with fences inserted by Parry.

After the very long benchmark run for both versions of the algorithm the results for each benchmark will be available in two folders:

$PROJECT_DIR/examples/tl2/build/stamp/notes/benchmarks/tl2/arm/unopt/
$PROJECT_DIR/examples/tl2/build/stamp/notes/benchmarks/tl2/arm/opt/

The opt version is the version where Parry has inserted the fences, the unopt version is the original with the memory fence macros appropriately defined for each platform.

To run the benchmarks for x86 on either 32 bit or 64 bit platforms replace arm in the orignal command with x86.

Note We have only run the benchmarks on 32 bit ARMv7 and 64 bit x86 machines.

Comparing Results

The results in figures 11 and 14 are derived from the arithmetic mean of each benchmark for each algorithm on each architecture. The data that produced the graphs can be found in the following spreadsheets and is taken from the output files in the project subdirectories listed above:

  1. TL2
  2. RSTM

Development Setup

Parry was developed using graph-tool, Clang 3.3, Python 2.7, and Sage.

For development on Ubuntu based systems you'll need to install the libclang-3.3-dev and graph-tool packages. If you are working in Ubuntu 14.04 you can simply run bin/setup.sh to add the universe repository and install the necessary system packages. The full list that will be installed is as follows:

llvm-3.3
libclang-3.3-dev
python-pip
graphviz
python-graph-tool

NOTE that the graph-tool package is from an unofficial package repository included as deb http://downloads.skewed.de/apt/trusty trusty universe required in your sources list.

After that the setup script will install the distribute and virtualenv packages using pip. Once the script is finished or you've completed installing the dependencies manually enter the project directory and issue the following:

# NOTE the --system-site-packages is to enable access to the graph_tool library
virtualenv --system-site-packages env

# enter the virtual environment
source env/bin/activate

# setup the other package dependencies
python setup.py develop

Vagrant Setup

Alternately you can use Vagrant to automate the entire process on Windows, Mac, or Linux. Once you've installed Vagrant and Virtualbox simply run vagrant up from the project directory. The base image download and provisioning may take a few minutes. Once that is completed:

vagrant ssh
cd /vagrant
source env/bin/activate
python setup.py develop

Sage

The provisioning script bin/setup.sh downloads and installs sage. Unfortunately the sage mirrors don't maintain old versions for download so the download in the provisioning script will eventually fail. The easiest fix is to visit:

http://files.sagemath.org/linux/64bit/index.html

And check the latest version and then edit bin/setup.sh changing 6.6 to the latest version in the following place:

dir=sage-6.6-x86_64-Linux

Tests

To run the test suite setup the virtual environment as above and from the Parry project directory

python test.py

Linear Programming

The lp subdirectory contains some scratch work for finding counter examples and verifying properties of constraint matrices for the characterization of multicut as a linear programming problem.

In particular lp/if.py performs a brute force search for matrices that are not totally unimodular permutations of a simple cascading set of orders (defined as graph in the file) with if "jumps" inserted. Columns are considered pairwise and each permutation of rows is divided between replacing that rows pair of columns with ones or zeros. The goal is to find a matix that is not TU by trying to simulate the addition of if (not if/then/else statements) at a given position across where the orders are concerned.

NOTE that while lp/if.py relies solely on SymPy's built in matrix library for simplicity's sake, lp/tu.py relies on Sage's mixed integer linear programming support support/library.

Resources

  1. libclang docs
  2. Intro blog post - note the TranslationUnit object.
  3. graph tool
  4. llvmpy

TODO

  1. Remove Sage dependency. There are ILP solvers that can be used as libraries. The current implementation relies on the Sage runtime.