Wiki

How to Write a PaRSEC-Enabled Operation by editing the JDF file directly

See Understanding the Compilation process to get a big picture.

Input Format

The syntax is presented with a simple example that is commented below:

extern "C" %{
#include <core_blas.h>

#include "parsec.h"
#include "data_dist/matrix/matrix.h"
%}

A            [type = "tiled_matrix_desc_t*"]
uplo         [type = PLASMA_enum]
INFO         [type = "int*"]

/**************************************************
 *                      POTRF                     *
 **************************************************/
POTRF(k)

// Execution space
k = 0..A->mt-1

// Parallel partitioning
: A(k, k)

// Parameters
RW T <- (k == 0) ? A(k, k) : T HERK(k-1, k)
     -> T TRSM(k, k+1..A->mt-1)
     -> A(k, k)

BODY
        CORE_zpotrf(uplo, A->mt, T, A->mb, INFO );
END

/**************************************************
 *                      TRSM                      *
 **************************************************/
TRSM(k, n)
// Execution space
k = 0..A->mt-1
n = k+1..A->mt-1

// Parallel partitioning
: A(n, k)

// Parameters
READ  T <- T POTRF(k)
RW    C <- (k == 0) ? A(n, k) : C GEMM(k-1, n, k)
        -> A HERK(k, n)
        -> A GEMM(k, n+1..A->mt-1, n)
        -> B GEMM(k, n, k+1..n-1)
        -> A(n, k)

BODY
   if( uplo == PlasmaLower ) {
           CORE_ztrsm(
                PlasmaRight, PlasmaLower, PlasmaTrans, PlasmaNonUnit,
                A->mb, A->mb, (Dague_Complex64_t)1.0, T, A->mb, C, A->mb);
    } else {
           CORE_ztrsm(
                PlasmaLeft, PlasmaUpper, PlasmaTrans, PlasmaNonUnit,
                A->mb, A->mb, (Dague_Complex64_t)1.0, T, A->mb, C, A->mb );
   }
END


/**************************************************
 *                      HERK                      *
 **************************************************/
HERK(k, n)
// Execution space
k = 0..A->mt-1
n = k+1..A->mt-1

// Parallel partitioning
: A(n, n)

//Parameters
READ  A <- C TRSM(k, n)
RW    T <- (k == 0) ? A(n,n) : T HERK(k-1, n)
        -> (n == k+1) ? T POTRF(k+1) : T HERK(k+1,n)

BODY
   if( uplo == PlasmaLower ) {
           CORE_zherk(
                PlasmaLower, PlasmaNoTrans, A->mb, A->mb,
                (double)-1.0, A, A->mb, (double) 1.0, T, A->mb );
   } else {
           CORE_zherk(
                PlasmaUpper, PlasmaTrans, A->mb, A->mb,
                (double)-1.0, A, A->mb, (double) 1.0, T, A->mb);
   }
END

/**************************************************
 *                      GEMM                      *
 **************************************************/
GEMM(k, m, n)
// Execution space
k = 0..A->mt-1
m = k+2..A->mt-1
n = k+1..m-1

// Parallel partitioning
: A(m, n)

// Parameters
READ  A <- C TRSM(k, n)
READ  B <- C TRSM(k, m)
RW    C <- (k == 0) ? A(m, n) : C GEMM(k-1, m, n)
        -> (n == k+1) ? C TRSM(k+1, m) : C GEMM(k+1, m, n)

BODY
    if( uplo == PlasmaLower ) {
           CORE_zgemm(
                PlasmaNoTrans, PlasmaTrans, A->mb, A->mb, A->mb,
                (Dague_Complex64_t)-1.0, B, A->mb, A, A->mb,
                (Dague_Complex64_t) 1.0, C, A->mb);
   } else {
           CORE_zgemm(
                PlasmaTrans, PlasmaNoTrans, A->mb, A->mb, A->mb,
                (Dague_Complex64_t)-1.0, B, A->mb, A, A->mb,
                (Dague_Complex64_t) 1.0, C, A->mb);
   }
END

It is a simplified version of the Cholesky Factorization in PaRSEC. Simplifications consisted in removing GPU-related code, hints on scheduling, and benchmarking options, to measure the overheads of scheduling of the engine.

First, the JDF begins with a preamble in C:

extern "C" %{
#include <core_blas.h>

#include "parsec.h"
#include "data_dist/matrix/matrix.h"
%}

In this preamble, we add the includes that define the prototype of the functions that each task will call, when executed. We also include the parsec.h and the generic include file for tiled matrix distributions. Functions definitions, or global variables, can be added here, but caution should be used, since this code will be dumped as is at the beginning of the generated C file.

Then, we define a set of objects that are used in the tasks below. These objects are recognized by the JDF translator, and thus they must follow a non-C syntax:

A            [type = "tiled_matrix_desc_t*"]
uplo         [type = PLASMA_enum]
INFO         [type = "int*"]

Here, we define for example the general type of the object A as a C structure of type tiled_matrix_desc_t* (more later about this on the data distribution page). uplo and INFO are LAPACK parameters, used by the kernels. The generator function that is going to be created by the JDF compiler will take all these objects as parameters, each one of the specified type (hence, the generator function prototype will be parsec_object_t *function(tiled_matrix_desc_t *A, PLASMA_enum uplo, int *INFO) See more explanations on How to Write a program that uses a PaRSEC-Enabled Operation).

A Cholesky factorization contains four kernels, called POTRF, HERK, TRSM, and GEMM. For each of these kernels, the JDF holds a parameterized task. Let us look closely at the first one, POTRF:

POTRF(k)

// Execution space
k = 0..A->mt-1

// Parallel partitioning
: A(k, k)

// Parameters
RW T <- (k == 0) ? A(k, k) : T HERK(k-1, k)
     -> T TRSM(k, k+1..A->mt-1)
     -> A(k, k)

BODY
        CORE_zpotrf(uplo, A->mb, T, A->mb, INFO );
END

The task is defined with a single parameter, k. This parameter must have an execution space that is entirely defined by the list of global objects that were defined in the beginning of the JDF. In this case, k is between 0 and A->mt-1 (inclusive), and since A->mt is a parameter of the generator function, k is entirely defined. As you can see with GEMM, for example, multiple parameters can be used, and parameters can depend on the value of preceding parameters (in GEMM, k is defined by A->mt, m is defined by the possible values of k and by A->mt, n is defined by the possible values of k and m). This definition must be restrictive: if the task POTRF(0) exists by its parameters (i.e. k can be equal to 0), it will be scheduled by the runtime system.

Then, the task defines its data affinity: task POTRF(k) will run on the node that holds the part of the matrix pointed by A(k, k). Each task must define a single data as its affinity. The entry is a reference on a parsec_ddesc_t object, with the appropriate number of parameters. See more on Write a program that uses a PaRSEC-Enabled Operation about data distributions.

After the data affinity, comes the data flow of the task.

// Parameters
RW T <- (k == 0) ? A(k, k) : T HERK(k-1, k)
     -> T TRSM(k, k+1..A->mt-1)
     -> A(k, k)

This data flow defines a single data element, T, that flows inside POTRF(k) with the first line, and after T has been processed by the task, it flows out of it, towards tasks or memory, as specified by the two last lines. Because the data element is accessed in read and write by the task, it is specified so using the keyword RW in front of it. Each data element that is so defined must have a single input flow (although multiple input flows are acceptable syntactically, they must be mutually exclusive, to ensure the data element is defined by a single task). The first line reads as "T comes from the memory if k is equal to 0, and then it is accessed directly from memory, as provided by A(k, k); otherwise, T comes from the data element also called T in the task HERK(k-1, k)". The second line reads as "T flows to the data element also called T in the task TRSM(k, k+1), and to the data element also called T in the task TRSM(k, k+2), and so on until the data element called T in the task TRSM(k, A->mt-1)". If this set reduces to the empty set, T will not flow out of this task. The last line reads as "one this data element has been processed by this task, no other task of the same operation will modify its value, and it should be copied back into memory, at the location A(k, k)".

The PaRSEC runtime engine does not guarantee that T will always be located in A(k, k) during the whole execution: it can be copied in a temporary memory, for communications or optimization purposes. Nor does it guarantees that T will not point to A(k, k) if this is at all possible. That is the reason why the developer should always specify when the data should be safely written back in memory, and never access the memory directly, but only through the data flow.

Data flows in the JDF must be symmetric: if POTRF writes that it receives data from HERK, HERK must write that it sends data to POTRF. The inversion of parameters must match.

Tasks can define up to MAX_PARAM_COUNT data elements. If more data elements are necessary, it may be necessary to create pseudo tasks to aggregate them.

Additionally to ternary conditionals, the grammar accepts binary conditionals, and most of the classical operations on integers.

The last element of the task definition is its body:

BODY
        CORE_zpotrf(uplo, A->mb, T, A->mb, INFO);
END

The body is a piece of C code that can use the parameters of the task, the data elements defined in the task, and all the global objects that have been defined (although it should not use data references on data elements that flow in one of the tasks). For example, the body of POTRF uses INFO and uplo, which are a memory reference and an integer defined by the global objects, but do not flow in the data flow. And it uses T, which is a data element defined by the data flow of this task. In this case, the C-code to call is minimal: it is just a simple call to the BLAS routine potrf with the appropriate parameters.

The JDF translator will create a piece of code that defines T as a function of the data flow, then dump this C-code, which is not parsed or interpreted, and then a piece of code to handle T (potentially copying it back into memory, if needed), and/or pass it on to other tasks.

Calling the JDF Translator

A JDF file can, and must, be converted into a source, aka. compilable, version using the PaRSEC source-to-source precompiler, parsec_ptgpp. This translator, parsec_ptgpp, is located in the PaRSEC source parsec/interfaces/ptg/ptg-compiler sub-directory, and is installed in the bin directory upon successful compilation.

The outcome of the precompiler is to generate a *.c source file, and the associated .h* header file that correspond to the JDF dataflow, files containing the concise representation of the DAG, and the events that triggers the progress of the entire algorithm.

parsec_ptgpp takes the following options:

--debug|-d Enable bison debug output
--input|-i Input File (JDF) (default '-')
--output|-o Set the BASE name for .c, .h and function name (no default). Changing this value has precedence over the defaults of --output-c, --output-h, and --function-name
--output-c|-C Set the name of the .c output file (default 'a.c' or BASE.c)
--output-h|-H Set the name of the .h output file (default 'a.h' or BASE.h)
--function-name|-f Set the unique identifier of the generated function The generated function will be called PaRSEC_<ID>_new (default a)
--dep-management|-M Select how dependencies tracking is managed. Possible choices are 'index-array' or 'dynamic-hash-table' (default 'dynamic-hash-table')
--noline Do not dump the JDF line number in the .c output file
--line Force dumping the JDF line number in the .c output file Default: --line
--preproc|-E Stop after the preprocessing stage. The output is generated in the form of preprocessed source code, but they are not compiled.
--showme Print the flags used to compile for preprocessed files

And some warning-related options (by default, all warnings are printed). You can disable the following:

--Werror Exit with non zero value if at least one warning is encountered
--Wmasked Do NOT print warnings for masked variables
--Wmutexin Do NOT print warnings for non-obvious mutual exclusion of input flows
--Wremoteref Do NOT print warnings for potential remote memory references

All warnings should be carefully investigated by the PaRSEC-Enabled Operation developer who wants to get a working JDF, because most of them are fatal to a correct behavior in a distributed runtime.

Now What?

Your PaRSEC-Enabled operation is ready! To test it, you should Write a program that uses a PaRSEC-Enabled Operation.