Performance Improvements for PAPI_read on arm64 Processors

Issue #104 on hold
Masahiko Yamada created an issue

On the x86 processor, the kernel since linux-4.13.0 implements PMU register
access directly from user space by mmapping the kernel area for perf_event
into user space using the file descriptor(fd) opened by perf_event_open and
calling the assembler instruction (rdpmc instruction) to access PMU registers
from user space.

It is being implemented in the kernel community to implement
the same functionality on arm64 processors as on x86 processors.

Comments (4)

  1. Masahiko Yamada reporter

    Dear PAPI library development team,

    In the kernel community, in order to realize the PMU register access function directly from user space on the arm64 processor,
    the process of calling the assembler instruction to access the PMU register from user space was standardized as a perf_event related library(libperf.so),
    and the implementation method was reviewed so that the user does not have to write the assembler instruction directly.
    On x86 processors, libperf.so also eliminates the need for users to directly invoke assembler instructions(rdpmc).

    For this reason, in order to improve the performance of PAPI_read on the arm64 processor,
    it is necessary for the papi library to improve the performance of PAPI_read by using the API provided as libperf library,
    instead of using the existing implementation of PAPI_read on the x86 processor as it is.

    The papi library should have an implementation that allows PAPI_read to perform better if libperf.so exists.
    In other words, if libperf.so did not exist, PAPI_read would still accept a call to a system call.
    To do this, the papi library must implement libperf.so to be dynamically loaded directly by dlopen.

    Please reply if you have any suggestions on how to implement PAPI_read performance improvements with the libperf library.

    Also, on Fugaku(a64fx) PAPI_read performance is slower than on Intel processors,
    and Fujitsu wants to make use of the libperf library to improve PAPI_read performance on arm64 processors.

    We at Fujitsu will implement and test the performance improvements of PAPI_read with the libperf library on the arm64 processor,
    so we would like the papi community to do a source review and evaluate the performance on intel processors.

    Best regards,
    Masahiko Yamada

  2. Daniel Barry

    Please reply if you have any suggestions on how to implement PAPI_read performance improvements with the libperf library.

    Also, on Fugaku(a64fx) PAPI_read performance is slower than on Intel processors,
    and Fujitsu wants to make use of the libperf library to improve PAPI_read performance on arm64 processors.

    We have confirmed the issue and are looking into this.

    Thank you,

    Daniel Barry

  3. Masahiko Yamada reporter

    Dear papi library development team,

    We considered the following 3 plans to improve the performance of PAPI_read for arm64 processors:.

    Plan 1 Using libperf throughout perf_event processing
    Plan 2 Use libperf only for direct PMU register access from userspace
    Plan 3 Porting direct user space PMU register access from libperf to the papi library without libperf

    Currently, libperf is still broken, and libperf cannot be used to implement perf_event-related operations of the papi library
    because libperf does not have the ability to call ioctl() or fcntl() on the perf_event_open file descriptor.

    For this reason, we decided to implement plan 3 and test the prototype to see
    if the PMU register direct access function from the arm64 version user space can work without using libperf in the latest kernel.

    I have posted the prototyped modifications below, so please review them.

    RFC:PAPI_read performance improvement for the arm64 processor
    icl / papi / Pull Request #347: RFC:PAPI_read performance improvement for the arm64 processor — Bitbucket

    The fix builds on the map_read_self() function in the x86 release and implements the mmap_read_self() function in the arm64 release,
    referencing the perf_mmap__read_self() function in libperf's tools/lib/perf/mmap.c.

    The x86 version of PAPI_read does not require libperf to be used.

    If I can get PAPI_read to work without libperf in my prototype for performance improvements in the arm64 version,
    I would like to ask you if it is better not to use libperf in the final version, or libperf in the final version.

    Best regards,
    Masahiko Yamada

  4. Log in to comment