UFLACS preintegration emits inefficient code

Issue #173 resolved
Jan Blechta created an issue
from dolfin import *

import dijitso; dijitso.set_log_level("debug")

fcp = parameters['form_compiler']
#fcp.add('enable_preintegration', False)  # Fixes the problem

mesh = UnitCubeMesh(1, 1, 1)
V = FunctionSpace(mesh, 'Nedelec 1st kind H(curl)', 3)
u, v = TrialFunction(V), TestFunction(V)
a = (inner(curl(u), curl(v)) - Constant(1)*inner(u, v))*dx
assemble(a)

C++ compiler eats over 10 GB of memory when trying to compile above. The problem is present in releases 2017.1.0 and 2017.2.0.

Comments (12)

  1. Jan Blechta reporter

    No, tsfc keeps loops of rows and columns of the cell tensor for both spectral (default in FFC) and tensor. To summarize, with

    fcp['representation'] = 'uflacs'
    

    g++ eats more than 10GB of memory (I don't know how much actually, I killed it), because the loops over i in range(45), j in range(45) get unrolled.

    fcp['representation'] = 'uflacs'
    fcp.add('enable_preintegration', False)
    

    works fine as well as

    fcp['representation'] = 'tsfc'
    fcp.add('mode', 'spectral')
    
    COFFEE finished in 0.0179336 seconds (flops: 1350658 -> 1350658)
    

    and

    fcp['representation'] = 'tsfc'
    fcp.add('mode', 'tensor')
    
    COFFEE finished in 0.0176239 seconds (flops: 365223 -> 365223)
    
  2. Fabian Löschner

    EDIT: The code presented below produces wrong results - I just presented it here to demonstrate that the issue vanishes if the assignments don't get unrolled.

    I enforced that preintegrated tables don't get inlined and modified the generate_preintegrated_dofblock_partition() method such that it creates loops, e.g. in your example

    for (int i = 0; i < 44; ++i)
            for (int j = 0; j < 44; ++j)
                A[i * j] += sp[409] * PI44[0][i][j];
    

    and so on. With these modifications, gcc compiles the example without problems. So maybe it is sensible to define an upper size limit for inlining of the preintegration tables?

    Regarding my modification of the method for testing: I replaced the loop

    for ii in itertools.product(*blockranges):
        ...
    

    with

    A_idx = 1
    for index_symbol in arg_indices:
        A_idx *= index_symbol
    
    P_arg_indices = arg_indices
    P_arg_indices_rev = tuple(reversed(arg_indices))
    if blockdata.transposed:
        P_arg_indices, P_arg_indices_rev = P_arg_indices_rev, P_arg_indices
    
    P_ii = P_entity_indices + P_arg_indices
    A_rhs = f*PI[P_ii]
    loop_code = L.AssignAdd(A[A_idx], A_rhs)
    
    for idx, blockrange in enumerate(blockranges):
        loop_code = L.ForRange(P_arg_indices_rev[idx], blockrange.start, blockrange.stop, loop_code)
    
    parts += [loop_code]
    

    and directly returned the parts list without calling generate_tensor_value_initialization() afterwards. Obviously it's just a hack but it demonstrates that it compiles without inlining+unrolling. Additionally it would make sense in this particular example to join all tables to a single array in order to fuse all table loops.

    If you think that this is the right direction, I can flesh out my solution a little bit more, run the tests, etc.

  3. Fabian Löschner

    I tried an alternative by splitting each A[...] = assignment into multiple assignments, e.g. that there are only 8 variables in each sum instead of previously ~90. Unsurprisingly, this did not help. GCC still eats memory when compiling the generated code. Apparently this transformation did not prevent the optimization which is responsible for the problem.

    Furthermore, I continued to follow the approach I mentioned above and improved it to actually produce correct code. I ran the regression tests on my changes and obviously it failed the code diff checks but all result validation checks were successful. The full code of my changes can be found in this commit. The generated code looks pretty ugly, e.g.:

    int k = 0;
    int l = 0;
    for (int i = 8; i < 45; ++i)
    {
        for (int j = 0; j < 7; ++j)
        {
            A[45 * i + j] += sp[409] * PI44[0][k][l];
            ++l;
        }
        for (int j = 8; j < 45; ++j)
        {
            A[45 * i + j] += sp[409] * PI44[0][k][l];
            ++l;
        }
        l = 0;
        ++k;
    }
    k = 0;
    

    but the code generation by FFC/UFLACS for this example is much faster than before and GCC compiles it without any problems. As a heuristic for switching to this behavior I currently use whether one of the pre-integration tables is larger than 1024 values (the example that triggered this issue has 44*44 = 1936 entries in the largest table). For the regression tests I used a limit of 10 entries to ensure that it works with all test cases.

  4. Log in to comment