Wiki

The Parallel Runtime Scheduling and Execution Controller (PaRSEC) environment provides a runtime component capable of dynamically executing on heterogeneous distributed systems along with a productivity toolbox. These features comprise a development framework that supports multiple domain-specific languages and extensions and includes tools for debugging, trace collection, and analysis. The Distributed Tasking for Exascale (DTE) project plans to further develop these PaRSEC features for the Exascale Computing Project (ECP)---in terms of scalability, interoperability, and productivity---to line up with the critical needs of ECP application communities.

WHY?

Exascale computing will require many system architecture changes, including a drastic increase in hardware parallelism. It is widely agreed that this unprecedented increase in concurrency---at least three orders of magnitude---is among the most formidable challenges of extreme-scale computing. Programming models that cannot expose and exploit such vast amounts of parallelism, even as it varies dynamically in response to ongoing system conditions, will struggle to reach exascale or even near exascale. Moreover, the fact that the amount of available parallelism will vary dynamically means that it is problematic to rely on known techniques, such as MPI+X, to consistently exploit all of the potential concurrency. Our solution, implemented in production form in the PaRSEC environment, focuses on developing a programming paradigm that can both expose and dynamically manage massive levels of parallelism while delivering performance portability across heterogenous systems.

The limitations of current programming paradigms---relative to the new concurrency landscape---are well known, and the search for alternative approaches has demonstrated proof-of-concept impacts in application scalability and efficiency. These limitations require application developers to make a significant development shift to transition their applications toward modern programming paradigms. To aggregate a collection of computing elements that is growing exponentially, new system architectures rely on hierarchical composition: the memory hierarchy deepens, thereby increasing non-uniform memory access (NUMA) effects; network interconnects use high-dimension tori, or other scalable structures; and manycore CPUs are embedded in the computing nodes, either through the use of separate or self-hosted accelerators. Extracting performance from such hardware, where the level of parallelism is increased 1,000 fold, requires that one exploits an even larger degree of parallelism from the application. Applications ready for exascale feature such potential for parallelism, but once exposed, decisions must be made at runtime for the efficient placement of work. Synchronous programming models make such decisions exceedingly difficult, if not impossible.

Consequently, a programming paradigm built around a far less synchronous approach is needed. Task-based programming has proven to be both efficient and productive in this regard. In task-based programming, the algorithm is a set of interdependent, fine-grain tasks (a set of instructions that access and modify an explicit and bounded amount of data), and a runtime system is responsible of scheduling these tasks while satisfying the data dependencies between the tasks to maximize resource utilization and absorb system noise and communication delays (delays caused by system noise introduce significant slack in large-scale synchronous applications). With the PaRSEC framework, we propose a solid implementation of this paradigm---a runtime system capable of efficiently executing large, task-based applications on distributed heterogeneous resources.

HOW?

With PaRSEC, we focus our work on two sides of this challenge: (1) the management of extreme-scale parallelism and (2) the reasonableness of the programming model (both of these topics are part of the Top Ten Technical Challenges for Exascale defined by the Advanced Scientific Computing Advisory Committee Findings). Specifically, we deliver the PaRSEC framework, which is capable of handling heterogeneous resources on a scale of a single node all the way to exascale supercomputers, while providing an API to facilitate application programmability and increase scientific productivity. We define programability as the capacity to extract performance from complex architectures, and we define scientific productivity as the capacity to create efficient executables for hybrid exascale computers from a source code that is easily maintainable by a domain scientist.

To reach these goals of programmability, scientific productivity, and scalability, the PARSEC task environment simultaneously addresses the technical challenges associated with domain-specific languages, scalable system software, evolving technology, energy efficiency, and resilience. More details about how PaRSEC addresses each challenge are provided below.

Domain-Specific Languages:

A major effort will extend PaRSEC support for compiler integration, to enable domain-specific languages (DSL) to be easily designed and integrated with the runtime. Outside of a task-based approach, no programming model will be mandated, but the delivered runtime will provide interoperability and execution support for a combination of OpenMP, dataflow, sequential task flow, and other---yet-to-be-discovered---task-based programming approaches.

Scalable System Software:

The design of the PARSEC runtime system focuses on scalability. Task knowledge should be limited to processes that are responsible for their execution, scheduling decisions must be entirely distributed, algorithmic correctness should not require control synchronization, and synchronous collective synchronization should be avoided. From a technical perspective, particular emphasis was placed on reducing the memory consumption of the runtime itself and on providing a scalable version of all distributed constructs. Despite these design and implementation efforts, as larger environments become available they will pose new challenges, and the implementation will need to be continuously tested at larger scales, and bottlenecks will need to be identified and removed if they appear.

Evolving Technology:

For portability, PARSEC comes with a default MPI driver, but access to lower-level network features (e.g., active messages, direct memory access engines, and hardware-supported asynchronous communications) will have a positive impact on performance. A bigger variety of accelerators, self-hosted accelerators, and relaxed memory models are also disrupting traditional technologies, and the PARSEC runtime provides a flexible framework capable of integrating these challenging systems into a manageable and efficient abstraction.

Energy Efficiency:

The hardware trend (e.g., Haswell, Knights Landing, Knights Hill) to expose finer-grain control over power consumption provides the capacity to turn cores on and off as work/demand dictates. A task runtime system that decides dynamically where tasks execute can take advantage of these hardware capabilities and adapt to a dynamically evolving set of computing resources. With it's knowledge about the tasks to be executed, it's understanding of the hardware capabilities, and it's flexible module scheduling capabilities, PaRSEC is the ideal software to address an application's dynamic energy needs.

Resilience:

As the data that tasks require and produce are explicit for the runtime system, general strategies of data replication or incremental checkpoint/restart can be embedded in the runtime and provide automatic support for soft error detection and process failure.

OVERVIEW

The PaRSEC environment simplifies and accelerates the design and execution of high-performance applications on leadership class systems. PaRSEC helps the application programmer express parallelism without requiring the intricacies of explicit parallelism in the programming models. In PaRSEC, algorithms are expressed as a set of elementary tasks, which can then execute efficiently at a massively parallel scale and be automatically delegated to accelerators. These tasks operate on explicit input data and produce (possibly in-place) output data, and the PaRSEC runtime considers that data flow for (1) ensuring that tasks using the same data execute in the correct order, and (2) automating data movement across nodes and between accelerators and hosts. The PaRSEC runtime was designed from the beginning to handle extremely fine-grain tasks (tens of microseconds) and to support large-scale, highly heterogeneous systems.

As a low-level infrastructure, the PARSEC runtime can serve to spotlight sensitive operating system (OS)/hardware features that have a notable impact on the efficiency of asynchronous applications. The PARSEC runtime can also be used to test which OS/hardware innovations lead to technological dead-ends or are too challenging to effectively exploit in practice. Moreover, it can be used to validate key technical alternatives (e.g., which relaxed hypothesis works on memory coherency, which accelerator technology, what network capabilities are effective), while hiding the additional complexity in a middleware layer, so that applications continue to experience a comprehensible machine abstraction.

MACHINE ABSTRACTION

The PaRSEC Runtime presents the hardware resources through an abstract machine comprised of multiple nodes, with each node divided into multiple virtual processes. Each virtual process contains multiple flows of execution (streams) that execute tasks serially. Streams can access any data in the node-local memory, so that the scheduler may select the most appropriate stream to execute tasks while optimizing time, data reuse, NUMA access cost, and load balancing between streams at the same time. The end-user can specify how many streams will be used in a PaRSEC machine, the streams' placement on the physical cores, and can specify a stream to span multiple cores (e.g., when the task kernel creates an OpenMP region).

A node may also feature accelerator resources. An accelerator is represented by a supplementary set of execution streams. For each task class, the programmer provides the CPU stream operator and can optionally provide an accelerator operator (e.g., a CUDA function) as well as hints regarding the computational load of the operator. The PaRSEC runtime automatically copies the data between the host memory and the accelerators. Versioned data copies remain cached in the accelerator memory; coherency and version management of copies---as well as their transfer back-and-forth to the accelerators---is handled by the PaRSEC runtime. Similarly, the selection of a CPU execution stream or a GPU execution stream is delegated to the scheduler.

The data collection describes the distribution of data blocks across multiple nodes. The end user is in charge of selecting both the data distribution of the collections and the affinity of tasks with nodes. Provided with this information and the dataflow unfolding from the node-local scheduler, the runtime schedules, asynchronously, the necessary data transfer to move inputs to task execution sites. All communication operations are implicit in the program, yet the communication volume can be finely controlled by an expert programmer by setting the data collection distribution and task affinity.

PROGRAMMING PARSEC

The PaRSEC infrastructure evolves around the notion of a task, an atomic computational entity defined as, ``a pure function dependent only on its inputs and without any synchronization point.'' It is assumed that the application developer will provide implementations of these tasks, or kernels, guaranteeing a highly efficient entity for different architectures and ensuring that each incarnation of these tasks, or kernels, will achieve the practical peak performance for given hardware. The runtime will then be responsible for asynchronously orchestrating the data movement between the different computational entities in order to provide a high occupancy for all available computational units. This separation of concerns ensures: (1) a user's productivity by allowing the development of hardware-agnostic algorithms and (2) performance portability by relying on highly optimized libraries provided by hardware manufacturers.

A PaRSEC routine is a set of task classes that represent elementary operations applied to blocks of data. Each task class describes the inputs, the output, and the operator applied to the input data to produce the output data. Tasks are pure and are read only or modify only input data and output data. During the execution, all task instances are scheduled on the resources in parallel.

The PaRSEC environment provides multiple front-ends or DSLs that are tailored for the particular use case (e.g., chemistry, computational fluid dynamics, linear algebra) to write these task classes. High-level DSLs input either C, C++, or Fortran---depending on the customs and preferences of a given user community---and can automatically produce the dependency between tasks. The low-level DSLs permit expressing task classes and their data dependency explicitly in a C-like language. PaRSEC tasks operate on data blocks, which can be sparse, irregular, or have a datatype to represent a complex memory layout. Data blocks are assembled in distributed collections that represent the distribution of the dataset on the machine. PaRSEC provides a rich toolbox of data collections to distribute data according to common or fully customized patterns, as well as common parallel operators (e.g., maps, broadcasts, reductions) that operate on such collections.

Interoperability with legacy applications

PaRSEC is interoperable with legacy technologies (e.g., MPI, CUDA), which permits an incremental upgrade of existing applications. Typically, a programmer will start from an existing single program, multiple data (SPMD) application and transform some particularly costly computational routine with PaRSEC, while leaving the rest of the application unchanged. The PaRSEC-accelerated routine will typically look like and behave like an SPMD routine when viewed from the outside, while internally, it will unleash the full parallelism permissible from the dataflow.

SOFTWARE INFRASTRUCTURE

Modular Architecture

PaRSEC is designed following a modular component architecture that enables users and application developers to dynamically select specific features and capabilities at run time to adapt the task system either to specific hardware environments or to enforce distinct behaviors. Most of these components are generic and provide portable that can be used by any application. However, more specialized behaviors can be enforced by an application through delivering specialized components for each set of standardized application programming interfaces (APIs). As an example, the node-level task scheduler is provided as a component, and this component enables users to select, dynamically, what implementation of the scheduler is used for the current set of tasks. The runtime system comes with a predefined set of schedulers (favoring cache reuse for NUMA machines, a user-defined priority, or a mixture of the two), but the user may provide his/her own scheduler following the module's API by using the dynamic library loaders.

Communication Engine

In PaRSEC, the runtime system manages the communication transparently for the application developer: following a data flow approach, the algorithm is expressed as a set of tasks applied on data. Data elements are described by the application developer for the runtime system, and---starting from that point---the runtime system moves the data where the tasks require them, without the need for the application to explicitly call MPI routines. Communications happen in the background to overlap communication between each other and with computations, thereby hiding the cost of message passing from both a programmatic and a performance point of view. Communications are scheduled following an eager strategy that aims to prepare the data as soon as possible without hindering the computation's progress.

Programming Interfaces

PaRSEC features multiple programming interfaces and is designed to be extended to interfaces suited for a given application developer's scientific domain. The two general-purpose programming interfaces distributed with the main PaRSEC library are (1) the sequential task flow (STF) interface and (2) the parameterized task graph (PTG) interface.

STF is pervasive in modern task systems: a thread inspects the set of tasks to execute, discovering them as it unrolls a sequential algorithm. Instead of executing each task as it is discovered, the tasks are saved in an internal representation of the directed acyclic graph (DAG) of tasks, where each node of that graph represents a discovered task. The API of the STF enables the application developer to express what data each task accesses and with which particular traits. Each incoming edge of that node then represents the dependences to other previously discovered tasks. These dependences are inferred from the Bernstein conditions over the data accessed by each task. As the DAG is discovered, ready tasks are scheduled on cores or accelerators by one of the available PaRSEC schedulers. The execution, although it is parallel, is guaranteed to be semantically equivalent to the execution of the serial code, providing an elegant interface for automatic parallelization.

PTG is less common. A PTG is a synthetic and abstract representation of the entire DAG of tasks. Leveraging the parameterization, the representation is compact and does not depend on the problem size. Although this interface requires a deeper understanding of the dataflow of a given algorithm, the benefits of improved scalability and reduced management overheads make PTG the interface of choice for the Distributed Parallel Linear Algebra Software for Multicore Architectures (DPLASMA) library that is built on top of PaRSEC.

Tensor Task Graph is an ongoing effort to define a set of flexible C++ templates for describing data-dependent algorithms, with a focus on programmer productivity and portable performance. A production-ready version of this effort is still in development, but this DSL will soon join the others as part of the PaRSEC software releases.

Data Management, Heterogeneous Architectures, and Interface with Other Paradigms

As a dataflow engine, PaRSEC manages the data that is passed to the task system from the input in the DAG of tasks to the output. Multiple versions of the same data may coexist at a given time, replicated on multiple nodes or memory devices, or different versions of the same data may be created to increase the parallelism and capacity of the overlap. All of the data movement is implicit in the algorithm description, which takes a declarative approach: tasks announce what data they need and what data they produce (or modify), and the runtime system is responsible for moving or copying the required data to a bank that is accessible to the device on which the task is scheduled.

This naturally enables execution over heterogeneous architectures for a very small coding price. If a version of the task suitable to run on an accelerator is present, the scheduler may decide to execute that task on the accelerator, and the data management system will be responsible for acquiring the appropriate copy of input and output data to enable that execution. From a programming point of view, the data are located on the accelerator memory and passed to the user-provided functions as pointers, without the need for the developer to implement the data movement. Data is also moved transparently between devices while cores are busy computing ready tasks.

Because not all parts of the application need to be written using the PaRSEC interfaces, different levels of composition (coarse or fine) are available to expose the data consumed by a PaRSEC algorithm as they are produced, or to extract the data produced by a PaRSEC algorithm and provide them to another programming paradigm. This approach, which was tested for complex applications like NWChem, enables the transition from an imperative parallel programming approach to a dynamic task system approach, step by step, without requiring a substantial rewrite of the application.

Profiling and Instrumentation

Another example of a modular component used in PaRSEC is the PaRSEC Instrumentation Module, which is built on top of a generic and highly efficient profiling system. PaRSEC enables end users to obtain detailed information on the execution of tasks, either for post-mortem analysis or during the execution to tune parameters. As tasks are scheduled on computing resources, enabling more tasks and progressing the computation, they follow a finite state machine description of their life cycle. In its simplest form, a task gets instantiated by the release of its input flows, the data corresponding to each flow is pulled from their current repository, the task is executed by a core or an accelerator---releasing new data---and the task is then released. The PaRSEC's state machine enables the application developer and the runtime system to create additional transitions, which allows a task to return during the data-pulling state; this comes in handy if, for example, the task's code requires a movement to a different device. Attached to each transition of this finite state machine, a tracing system records the times of each operation and can, at the user's request, decorate these records with user-selected hardware counters or user-defined parameters. To simplify the selection of these events, a selection of modules that can be loaded at runtime are made available. Similarly to the scheduling capabilities, users may define their own modules and provide them at runtime by following the API.

SCIENTIFIC OUTCOME

PaRSEC has achieved promising results on a variety of applications, including dense and sparse linear algebra, highly irregular domain science code, domain decomposition, and many more. In our publications, we have shown that PaRSEC greatly simplifies the writing of highly efficient linear algebra operations, where the flexibility of PaRSEC permits expressing algorithm-tailored behaviors (e.g., a reduction tree that reduces the communication delay in the performance-critical panel), yet operates on a flexible distribution of the data, and can thereby adapt to a variety of architectures and problem sizes. In addition to the performance gain, this example highlights one of the most important features of the dataflow programming approach exposed in PaRSEC, in which the depiction of the algorithm is independent of the data distribution and resource mapping, thereby improving performance portability of optimized codes.