Wiki
Clone wikipapi / PAPI-Overview
This page provides a general overview of the PAPI library with a discussion of all major features and functionality.
- Intended Audience
- C and Fortran Calling Interfaces
- Components
- Example Code
- Events
- PAPI Counter Interfaces
- PAPI Timers
- PAPI System Information
- Advanced PAPI Features
- PAPI Error Handling
- PAPI Utilities
Intended Audience
Welcome to PAPI, the Performance API. This overview will provide you with a discussion of how to use the different components and functions of PAPI. The intended audience includes application developers, performance tool writers, and curious students of performance who wish to access performance data to tune and model application performance. You should have some level of familiarity with C and Fortran, and have a basic knowledge of computer architecture and programming.
C and Fortran Calling Interfaces
PAPI is written in C. The function calls in the C interface are defined in the header file, papi.h and consist of the following form:
<returned data type> PAPI_function_name(arg1, arg2, …)
The function calls in the Fortran interface are defined in the header file, fpapi.h and consist of the following form:
PAPIF_function_name(arg1, arg2, …, check)
As you can see, the C function calls have equivalent Fortran function calls (PAPI_<call> becomes PAPIF_<call>). This is generally true for most function calls, except for the functions that return C pointers to structures, such as PAPI_get_opt and PAPI_get_executable_info, which are either not implemented in the Fortran interface, or implemented with different calling semantics. In the function calls of the Fortran interface, the return code of the corresponding C routine is returned in the argument, check.
For most architectures, the following relation holds between the pseudo-types listed and Fortran variable types:
Pseudo-type | Fortran type | Description |
---|---|---|
C_INT | INTEGER | Default Integer type |
C_FLOAT | REAL | Default Real type |
C_LONG_LONG | INTEGER*8 | Extended size integer |
C_STRING | CHARACTER*(PAPI_MAX_STR_LEN) | Fortran string |
C_INT FUNCTION | EXTERNAL INTEGER FUNCTION | Fortran function returning integer result |
Array arguments must be of sufficient size to hold the input/output from/to the subroutine for predictable behavior. The array length is indicated either by the accompanying argument or by internal PAPI definitions.
Subroutines accepting C_STRING as an argument are on most implementations capable of reading the character string length as provided by Fortran. In these implementations, the string is truncated or space padded as necessary. For other implementations, the length of the character array is assumed to be of sufficient size. No character string longer than PAPI_MAX_STR_LEN is returned by the PAPIF interface.
Components
PAPI provides several components that allow to monitor system information of CPUs, network cards, graphics accelerator cards, parallel file systems and more. While the CPU components, perf_event and perf_event_uncore, and the sysdetect component are enabled by default, all other components have to be specified during the installation process.
Components Types
In PAPI there are two types of components:
- Standard: fully initialised after a call to PAPI_library_init (e.g. perf_event and perf_event_uncore);
- Delay Init: fully initialized after a call to any of the PAPI functions that access the component, like PAPI_enum_cmp_event or PAPI_add_event (e.g. cuda and rocm).
After calling PAPI_library_init, Delay Init components are in an intermediate initialization state. This is conveyed to the user by setting the disabled flag of the component info structure to PAPI_EDELAY_INIT. If the user does not need to access the component info structure, the user does not need to be concerned with delayed initialization. Delay Init components initialization is completed by, e.g., a call to PAPI_enum_cmp_event. If the call returns PAPI_OK the component disabled flag is updated from PAPI_EDELAY_INIT to PAPI_OK.
The reason PAPI has Delay Init components is to minimize overhead. Some components, like GPU components, may have hundreds of thousands of events that require several minutes to be accessed. If a component with hundreds of thousands of events is configured in PAPI but the user does not need it, it would be unreasonable for the user to wait several minutes for the component to be initialized.
Installation of Components
To install PAPI with additional components, you have to specify them during configure.
For example, to install PAPI with the CUDA component enabled:
./configure --with-components="cuda"
If you want to install multiple components, you must specify them as a space separated list.
Example:
./configure --with-components="appio coretemp cuda nvml"
List of Components
The following table list all available components.
Before installing a component, please read further instructions by clicking on the desired component name.
Note: The name of the components in the table is shown as it must be used in configure.
Component Name | Description | |
---|---|---|
CPU | ||
perf_event | Linux perf_event CPU counters (default) | |
perf_event_uncore | Linux perf_event CPU uncore and northbridge (default) | |
perfctr | Linux perfctr CPU counters (only used for Linux before 2.6.31.) | |
perfctr_ppc | Linux perfctr CPU counters for IBM PowerPC (9) architecture (only used for Linux before 2.6.31.) | |
perfmon_ia64 | Linux perfmon2 CPU counters for Itanium architecture (only used for Linux before 2.6.31.) | |
perfmon2 | Linux perfmon2 CPU counters (only used for Linux before 2.6.31.) | |
GPU | ||
cuda | CUDA events and metrics via NVIDIA CuPTI interfaces | |
nvml | NVIDIA hardware counters (usage, power, temperature, fan speed, etc) | |
rocm | GPU events and metrics via AMD ROCm-PL API | |
rocm_smi | AMD GPU hardware counters (usage, power, temperature, fan speed, etc) | |
Power | ||
host_micpower | Host-side power usage on MIC guest cards | |
libmsr | Measuring and capping power usage on recent Intel architectures using the RAPL interface | |
micpower | Power usage on Intel Xeon Phi (MIC) | |
nvml | NVIDIA hardware counters (usage, power, temperature, fan speed, etc) | |
powercap | Linux powercap energy measurements | |
powercap_ppc | Linux powercap energy measurements for IBM PowerPC (9) architecture | |
rapl | Linux RAPL energy measurements | |
rocm_smi | AMD GPU hardware counters (usage, power, temperature, fan speed, etc) | |
sensors_ppc | Linux sensors_ppc energy measurements | |
Network | ||
infiniband | Linux Infiniband statistics using the sysfs interface | |
net | Linux network driver statistics | |
I/O | ||
appio | Linux I/O system calls | |
io | Linux I/O statistics from /proc/self/io | |
lustre | Lustre filesystem statistics | |
stealtime | Stealtime filesystem statistics | |
Other | ||
bgpm | Hardware counters for Blue Gene/Q | |
coretemp | Linux hwmon temperature and other info | |
coretemp_freebsd | FreeBSD hwmon temperature and other info | |
emon | EMON counters for Blue Gene/Q | |
example | Simple example component | |
lmsensors | Linux LMsensor statistics | |
mx | Myricom MX (Myrinet Express) statistics | |
pcp | Performance Co-Pilot | |
sde | Software defined events | |
vmware | Support for VMware (vmguest and pseudo counters) | |
sysdetect | Support for system detection information |
Example Code
Throughout this overview are a number of blocks of example code. It is our intention that this code will be executable by simply copying it into a file, compiling it, and linking it to the PAPI library. Many code blocks will reference an external error handling function called handle_error(). One implementation of such a function is shown below:
#include <stdlib.h>
#include <stdio.h>
#include <papi.h>
void handle_error (int retval)
{
printf("PAPI error %d: %s\n", retval, PAPI_strerror(retval));
exit(1);
}
We have developed and tested these examples assuming a linux and gcc toolchain. Your environment may differ and require appropriate adaptation. To compile the above error handler, assuming that the file containing it is in the same directory as papi.h, use a command line similar to:
gcc -I. -c handle_error.c
To compile a test program under the same conditions, use a command line like:
gcc -I. example.c handle_error.o libpapi.a -o example
If you encounter example code that will not compile and run, please let us know. Keeping our examples up to date is an ongoing process.
Events
PAPI counts events that occur on a cpu or other subsystem. There are usually more events to be measured than counter registers to count them in, so PAPI also provides the means to map events to counters. To learn more about events, click here, or on the title above.
In addition to the events that are native to each component, PAPI defines a set of preset events that are standardized across all cpu components. To facilitate the discovery of supported events, PAPI provides query functions to inquire about the availability of specified events. Events are often referred to by name, but internally PAPI uses an opaque code to specify an event. Translation functions are provided to convert between names and codes. For convenience, event codes for a specific component can be collected into event sets. A variety of functions are available to manage event sets. Additionally, a number of options can be set, either for the behavior of the whole library, or for an individual event set.
All of these features are described in greater detail below.
Native Events
Native events comprise the set of all events that are available for a specific component. For cpus, there are generally far more native events available than can be mapped onto PAPI preset events. For other components, native events are generally the only option available. Click here, or on the title above for more information on native events and examples of their use.
Preset Events
Preset events, also known as predefined events, are a common set of cpu events deemed relevant and useful for application performance tuning. PAPI defines a set of about 100 preset events for cpus, which can be found here. A given cpu will implement a subset of those, often no more than several dozen. Although the names and calling semantics of preset events are standardized across platforms, the exact definitions are determined by the underlying hardware. Caveat emptor. For more details on preset events and examples of their use, click here, or on the title above.
Event Query
Several low-level functions can be called to learn more about preset or native events.
PAPI_query_event returns a TRUE or FALSE to indicate if a given event is implemented on a given platform;
PAPI_get_event_info returns a structure containing information about a specific event; and
PAPI_enum_event returns the next event in a sequence given the event code of a specific event. This function is useful for enumerating over a list of events.
For more details on these functions and examples of their use, click here, or on the title above.
Event Translation
A preset or native event can be referenced by name or by event code. Most PAPI functions require an event code, while most user input and output is in terms of names. Two low-level functions are provided to translate between these formats. They are discussed with usage examples here or by clicking on the title above.
Event Sets
Event Sets are user-defined collections of hardware events (preset or native), which are measured together to provide meaningful information. Events in an Event Set must all belong to a single component. Multiple Event Sets can be defined at the same time, but only one per component can be active. For details on managing Event Sets, including function calls and example code, click here or on the title above.
Getting and Setting Options
There are a number of options that can globally affect the operation of the entire PAPI library or locally affect a specific event set. These options can be reviewed and set by calling a pair of low-level functions, as described in more detail here and via the title above.
PAPI Counter Interfaces
High Level API
The high-level API (Application Programming Interface) provides the ability to record performance events inside instrumented regions of serial, multi-processing (MPI, SHMEM) and thread (OpenMP, Pthreads) parallel applications. It is designed for simplicity, not flexibility. For more details click here or on the title above.
Simplified Rate Functions
PAPI provides four simplified functions to get Mflops/s (floating point operation rate), Mflips/s (floating point instruction rate), IPC (instructions per cycle), and EPC (arbitrary events per cycle). For more details click here or on the title above.
Low Level API
The low-level API (Application Programming Interface) manages hardware events in user-defined groups called Event Sets. It is meant for experienced application programmers and tool developers wanting fine-grained measurement and control of the PAPI interface. It provides access to both PAPI preset and native events, and supports all installed components. For more details on the Low Level API, click here or on the title above.
PAPI Timers
PAPI provides four functions to measure time in microseconds or cycles for either real (wall clock) time or virtual (process) time. These timers use the most accurate timers available on the platform in use. More information on these routines can be found here or by clicking the title above.
PAPI System Information
This section explains the PAPI functions associated with obtaining hardware and executable information. Code examples along with the corresponding output are included as well.
Advanced PAPI Features
PAPI supports a number of advanced features beyond simple event counting. You can learn more about these advanced topics by following the title links below.
Multiplexing
Hardware Performance Counters are generally a scare resource. There are often many more events of interest than counters to count them on. Multiplexing is one way around this dilemma. It doesn't come without trade-offs. Click here or the title above to learn more.
Parallel Programming
PAPI can be used with parallel as well as serial programs. For a discussion of issues that come up in threaded or multiprocess environments, click here or the title above.
Overflow
Most processors can generate an interrupt when a performance counter exceeds a threshold value. PAPI allows you to attach an interrupt handler to that occurrence so you can perform periodic activities where the period is determined by an event other than time. Learn more by clicking here or the title above.
Statistical Profiling
By using the overflow capabilities of PAPI, it is possible to create profiles of the distribution of various performance events across a selected address space. Learn more by clicking here or on the title above.
PAPI Error Handling
Sometimes things don't go as planned. Most PAPI routines will tell you when that happens. It's always a good idea to check if things worked and let someone know if they didn't. To learn more about the return codes that PAPI provides, and how to turn them into meaningful messages, click here or the title above.
Many of the code snippets in this Overview and in the PAPI man pages refer to a routine called handle_error. One possible implementation of this routine is shown here.
PAPI Utilities
A collection of simple utility commands is available in the src/utils directory. See individual utilities for details on usage.
Utility Name | Description |
---|---|
papi_avail | provides availability and detail information for PAPI preset events |
papi_clockres | prints clock latency and resolution |
papi_cost | provides costs of execution for PAPI start/stop, read and accum |
papi_command_line | executes PAPI preset or native events from the command line |
papi_decode | decodes PAPI preset events into a csv format suitable for PAPI_encode_events |
papi_event_chooser | given a list of named events, lists other events that can be counted with them |
papi_mem_info | provides information on the memory architecture of the current processor |
papi_native_avail | provides detailed information for PAPI native events |
Updated