Wiki

Clone wiki

parsec / Profiling PaRSEC Applications with PAPI

While running your PaRSEC-based application, you may be interested in measuring hardware performance information to identify potential improvements to your code or to gather some benchmarks to assess your program's performance. Within PaRSEC, there is a PINS module that utilizes the Performance Application Programming Interface (PAPI) to do just that. PAPI is a great tool for getting hardware counter information during a program's execution and incurs minimal overhead.

In order to use this feature you must install PAPI, which can be found here, and you need to set the following CMAKE parameters:

  • PARSEC_PROF_TRACE=ON
  • PINS_ENABLE=ON
  • PAPI_INCLUDE_DIR=[path/to/papi/include]
    • Note: The brackets are not included in the actual input. A sample path would look like: /opt/include/
  • PAPI_LIBRARY=[/path/to/papi/lib/libpapi.a]
    • Note: The brackets are not included in the actual input. A sample path would look like: /opt/lib/libpapi.a or /opt/lib/libpapi.so depending on whether you want to link papi statically or dynamically.

Once all of these things have been set, you must specify which PAPI events you would like to monitor and under what scheme you would like them to be monitored. To do this, go to your home directory and create a .parsec directory like so:

cd ~
mkdir .parsec

Now, we need to navigate to the .parsec directory and create an MCA parameters config file called 'mca-params.conf' in which we will be putting our specifications. So, use your favorite editor (mine's emacs) to create and edit this file:

cd ~/.parsec
emacs mca-params.conf

For this demo, we will name our profile file 'demo', and we will be monitoring L1 instruction cache misses on all cores and sockets after each task and L1 data cache misses on Socket 0's first core every .01 seconds. So, our mca-params.conf file will look like:

profile_filename = demo
mca_pins = papi
pins_papi = 1
pins_papi_event = S*:C*:F1:PAPI_L1_ICM,S0:C0:F.01s:PAPI_L1_DCM

The profile_filename field indicates the base file name for the profiling file(s) that are outputted, the mca_pins field specifies that we are using the papi PINS module, the pins_papi field specifies that we would like to enable the papi PINS component, and the pins_papi_event field specifies the PAPI events along with their socket, core, and frequency. The pins_papi_event field is of the form:

S[socket number]:C[core number]:F[frequency]:[PAPI event name],

This essentially means that the aspects of an event's configuration are delimited by a ':' character and the events themselves are delimited with a ',' character. Note that the socket, core, and frequency specifications must be preceded by 'S', 'C', and 'F' characters respectively. The socket and core numbers are the same as the number returned by hwloc for the socket or core you are interested in. The frequency specification can take on two different formats: task-based and time-based. For a task-based frequency, you merely specify the integer number of tasks after which you would like measurements to take place, so specifying "F1" would yield measurements after every task. For a time-based frequency, you must specify a number of time units (potentially a decimal number) followed by the short name for that time unit. Accepted short names are 's' for seconds, 'ms' for miliseconds, 'us' for microseconds, and 'ns' for nanoseconds. Note that no matter what units you specify for the time, the number you entered is converted to the default time units for your system (in my case this is nanoseconds).

To figure out what PAPI events are supported on your system along with their names and descriptions, you can navigate to your PAPI installation's bin directory and run the papi_native_avail utility. You can get a less detailed listing with the papi_avail utility.

Now that we've set up our config file, we can run our application to see how it works. For this demo, we'll use the testing_dpotrf (Cholesky Factorization) file in your build directory at /dplasma/testing/testing_dpotrf. So, after navigating to this directory we use the following command:

./testing_dpotrf -t 100 -N 2000

This command will run a Cholesky factorization using all available cores on the current compute node with a tile size of 100 and using a random 2000X2000 matrix. The output will look something like this:

[dancer.icl.utk.edu:30386] Using PAPI version 5040000
[dancer.icl.utk.edu:30386] Could not find a proposed time unit equivalent for nanosecond. Fall back to nanosecond
[dancer.icl.utk.edu:30386] No units found.  Assuming task-based frequency: 1
[dancer.icl.utk.edu:30386] Valid PAPI event PAPI_L1_ICM on socket -1 (-1 for all), core -1 (-1 for all) with frequency 1 tasks
[dancer.icl.utk.edu:30386] Valid PAPI event PAPI_L1_DCM on socket 0 (-1 for all), core 0 (-1 for all) with frequency 10000000.000000 nanosecond
Found 1 components, activated 1
[dancer.icl.utk.edu:30386] PAPI event PAPI_L1_ICM{int64_t}; core -1 socket -1 frequency 1 tasks enabled
[dancer.icl.utk.edu:30386] PAPI event PAPI_L1_ICM{int64_t}; core -1 socket -1 frequency 1 tasks enabled
[dancer.icl.utk.edu:30386] PAPI event PAPI_L1_DCM{int64_t}; core 0 socket 0 frequency 10000000.000000 ns enabled
[dancer.icl.utk.edu:30386] PAPI event PAPI_L1_ICM{int64_t}; core -1 socket -1 frequency 1 tasks enabled
[dancer.icl.utk.edu:30386] PAPI event PAPI_L1_ICM{int64_t}; core -1 socket -1 frequency 1 tasks enabled
[dancer.icl.utk.edu:30386] PAPI event PAPI_L1_ICM{int64_t}; core -1 socket -1 frequency 1 tasks enabled
[dancer.icl.utk.edu:30386] PAPI event PAPI_L1_ICM{int64_t}; core -1 socket -1 frequency 1 tasks enabled
[dancer.icl.utk.edu:30386] PAPI event PAPI_L1_ICM{int64_t}; core -1 socket -1 frequency 1 tasks enabled
[dancer.icl.utk.edu:30386] PAPI event PAPI_L1_ICM{int64_t}; core -1 socket -1 frequency 1 tasks enabled
[****] TIME(s)      0.06466 : dpotrf    PxQ=   1 1   NB=  100 N=    2000 :      41.271038 gflops

Now that you have completed your run, there will be a file in your current directory with a name that looks like: demo-0.prof-[random characters]. You write a python module to extract the information from this file by following the instructions here. There is also a section below that describes the use of a simple python module that you can use to get some simple plots of the PAPI counter values over time.

PAPI and SDE (Software Defined Events)

PaRSEC exposes a series of internal counters as PAPI events using the Software Defined Events architecture.

To access these counters, the final program needs to be linked with a recent version of libpapi, that features the SDE interface.

These PAPI events appear with the prefix sde:::PARSEC.

If papi_native_avail is linked with libparsec, it will show all available events exposed by PaRSEC. Alternatively, an application that links both PaRSEC and PAPI (with SDE enabled) can list the events iterating on PAPI_get_event_info.

To access these events inside a PINS tracing, one just needs to declare the event as one of them in the pins_papi_event MCA parameter. For example:

$shell > cat ~/.parsec/mca-params.conf
profile_filename = /tmp/parsec-pins-profile
pins_papi_event = sde:::PARSEC::SCHEDULER::PENDING_TASKS
mca_pins = papi
pins_papi = 1
$shell >

Will trace the number of pending tasks (tasks ready to execute, but waiting in one of the queues defined by the scheduler for this node), at the entry and termination of the execution of each task.

Virtual Process Mapping

In some cases you may want to perform your execution on fewer cores than the maximum available on a node. To do this, you simply need to use the -c option for your run to specify the number of cores, so if you wanted to perform the same example as above, but with 4 cores, you would use the following command line:

./testing_dpotrf -c 4 -t 100 -N 2000

When you specify 'n' cores, the first 'n' cores on the current compute node based on hwloc will be chosen as the cores used. In some instances you may want to specify the specific cores that you want to use for this run, in which case you will need to define a virtual process map file. This file will specify a series of virtual processes in which you can have multiple cores per virtual process and these cores will be the ones used for the execution instead of the first 'n' cores. So, to create one of these files, you merely have to define each virtual process on its own line in the file with the line specifying the number of cores in the virtual process and the core ids (as reported by hwloc) of the cores you want to use separated by commas. The format looks like this:

:[number_of_cores]:[first_core_id],[second_core_id],[etc...]
:[number_of_cores]:[first_core_id],[second_core_id],[etc...]
[etc...]

For our example with 4 cores, let's assume we have a compute node with two sockets each with an 8 core processor. Let's also assume that the cores of these processors are assigned the ids of 0 to 7 for the one processor and 8 to 15 for the second processor. So, let's say that we want to use the two lowest id cores on each processor for our sample run, which would be cores 0, 1, 8, and 9. If we group all of these cores into one virtual process, our virtual process file, called pmap for this example, would look like this:

:4:0,1,8,9

If we decided that we want the cores from each processor to be in their own virtual processes, we could change our pmap file to look like this:

:2:0,1
:2:8,9

Now that we have specified our virtual process map file, we can change our command line and run our test with the 4 cores that we want it to be run on. Assuming our pmap file is in the same directory as our test file, our command line would look like this:

./testing_dpotrf -c 4 -t 100 -N 2000 -- -V file:pmap

plot_scatter_papi.py

This python module can be found in [path_to_PaRSEC_home_directory]/tools/profiling/python/examples, and is a simple module that will take a binary profile file, multiple binary profile files from the same run in the case of a distributed run, or a converted HDF5 file as input and output a combined graph of all of the PAPI counter values for each counter over time, and a separate graph of each counter broken down by the socket, core, and frequency that each counter was counting. All of these graphs will be .png images made with the matplotlib python library.

Usage of this module is extremely simple with only two optional command line flags in addition to the list of file names. The two flags are -t and -u. The -t option specifies the time units used by PaRSEC on your system, which could be nanoseconds, microseconds, or cycles. To specify these time units, you use the -t switch followed by 'ns', 'us', or 'c' for each of the options respectively. The -u option specifies the time units you wish to use for the output plots, which could be nanoseconds, microseconds, milliseconds, or seconds. To specify these time units, you use the -u switch followed by 'ns', 'us', 'ms', or 's' for each of the options respectively.

So, if you want to use this module for the example above, you could rename your binary file from demo-0.prof-[random characters] to something easier to write like demo.prof, though this isn't required. Also to make your life easier, you can copy the binary file to the same directory as the python module. Let's assume that the system we are working on uses nanoseconds as its time units on this system (as seen in the sample output above) and we want our plots to use seconds for their outputs. In this case, we would use the following command line:

python plot_scatter_papi.py -t ns -u s demo.prof

This would produce output like the following on the command line:

Found units: 'ns'
Graphs will display time in seconds.
Conversion Factor: 1e-09

Importing matplotlib took 0.34679889679 seconds.

Converting binary trace to the ParSEC Trace Tables format...
Converting ['demo.prof']
Event TASK_MEMORY conv <size{int64_t}> length 8

Event Device delegate conv <> length 0

Event PINS_PAPI_S0_C1_F1t_0 conv <PAPI_L1_ICM{int64_t};> length 8

Event PINS_PAPI_S1_C8_F1t_0 conv <PAPI_L1_ICM{int64_t};> length 8

Event PINS_PAPI_S1_C9_F1t_0 conv <PAPI_L1_ICM{int64_t};> length 8

Event PINS_PAPI_S0_C0_F1t_0 conv <PAPI_L1_ICM{int64_t};> length 8

Event PINS_PAPI_S0_C0_F10000000.000000ns_0 conv <PAPI_L1_DCM{int64_t};> length 8

Event MPI_ACTIVATE conv <src{int32_t};dst{int32_t};tid{int64_t};hid{int32_t};fid{int8_t};pad{char[3]}> length 24

Event MPI_DATA_CTL conv <src{int32_t};dst{int32_t};tid{int64_t};hid{int32_t};fid{int8_t};pad{char[3]}> length 24

Event MPI_DATA_PLD_SND conv <src{int32_t};dst{int32_t};tid{int64_t};hid{int32_t};fid{int8_t};pad{char[3]}> length 24

Event MPI_DATA_PLD_RCV conv <src{int32_t};dst{int32_t};tid{int64_t};hid{int32_t};fid{int8_t};pad{char[3]}> length 24

Event potrf_dgemm conv <ddesc_unique_key{uint64_t};ddesc_data_id{uint32_t};ddessc_padding{uint32_t};k{int32_t};m{int32_t};n{int32_t}> length 28

Event potrf_dsyrk conv <ddesc_unique_key{uint64_t};ddesc_data_id{uint32_t};ddessc_padding{uint32_t};k{int32_t};n{int32_t}> length 24

Event potrf_dtrsm conv <ddesc_unique_key{uint64_t};ddesc_data_id{uint32_t};ddessc_padding{uint32_t};k{int32_t};n{int32_t}> length 24

Event potrf_dpotrf conv <ddesc_unique_key{uint64_t};ddesc_data_id{uint32_t};ddessc_padding{uint32_t};k{int32_t}> length 20

....<<<<
Parsing the PBT files took 0.122670888901 seconds, which is 0.0306677222252 seconds per thread.
Then we concatenate the event DataFrames....
   events DataFrame concatenation time: 0.00572299957275
Constructed additional structures in 0.00748085975647 seconds.
Generate trace to HDF5 format in 1.19245505333 seconds.
Loading the HDFed trace...
The load took 2.32339406013 seconds.

Populating user-defined lists...
Populating the lists took 0.00188398361206 seconds.

Colors Needed: 5
Plotting all PAPI counters together...
Saving plot as all_papi_counters.png...
Plotting and saving took 0.246990919113 seconds.
Plotting data for: PAPI_L1_DCM...
Saving plot as PAPI_L1_DCM.png...
Plotting and saving took 0.307183027267 seconds.
Plotting data for: PAPI_L1_ICM...
Saving plot as PAPI_L1_ICM.png...
Plotting and saving took 0.438541173935 seconds.
Total Time: 3.93275809288 seconds

The python module will also produce three image files of the plots listed as all_papi_counters.png, PAPI_L1_DCM.png, and PAPI_L1_ICM.png. These plots can be seen below.

all_papi_counters.png all_papi_counters.png

PAPI_L1_DCM.png PAPI_L1_DCM.png

PAPI_L1_ICM.png PAPI_L1_ICM.png

You may notice that the PAPI_L1_DCM counter is not counting exactly every .01 seconds as we specified, and the PAPI_L1_ICM counters for some of the cores have a break period in them even though they are supposed to happen after every task. These are artefacts of the scheduling of the tasks and the length of the execution. This is an extremely short test and as such scheduling issues become much more pronounced. You may also notice that the gap for core 0 is much larger than the other cores, which is due to the fact that the control flow is handled by that core. This explains why the PAPI_L1_DCM values aren't happening every .01 seconds because there aren't always tasks scheduled on that core.

Updated