UFLACS preintegration emits inefficient code

Issue #173 resolved

Jan Blechta created an issue 2018-02-20

from dolfin import *

import dijitso; dijitso.set_log_level("debug")

fcp = parameters['form_compiler']
#fcp.add('enable_preintegration', False)  # Fixes the problem

mesh = UnitCubeMesh(1, 1, 1)
V = FunctionSpace(mesh, 'Nedelec 1st kind H(curl)', 3)
u, v = TrialFunction(V), TestFunction(V)
a = (inner(curl(u), curl(v)) - Constant(1)*inner(u, v))*dx
assemble(a)

C++ compiler eats over 10 GB of memory when trying to compile above. The problem is present in releases 2017.1.0 and 2017.2.0.

Comments (12)

Jan Blechta reporter
- changed component to uflacs-repr
- 2018-02-20T13:47:06+00:00
Miklós Homolya
Is it because of too much unrolling? Do you have the same problem with TSFC tensor mode?
- 2018-02-23T08:40:16+00:00

Jan Blechta reporter

No, tsfc keeps loops of rows and columns of the cell tensor for both spectral (default in FFC) and tensor. To summarize, with

fcp['representation'] = 'uflacs'

g++ eats more than 10GB of memory (I don't know how much actually, I killed it), because the loops over i in range(45), j in range(45) get unrolled.

fcp['representation'] = 'uflacs'
fcp.add('enable_preintegration', False)

works fine as well as

fcp['representation'] = 'tsfc'
fcp.add('mode', 'spectral')

COFFEE finished in 0.0179336 seconds (flops: 1350658 -> 1350658)

and

fcp['representation'] = 'tsfc'
fcp.add('mode', 'tensor')

COFFEE finished in 0.0176239 seconds (flops: 365223 -> 365223)

2018-02-23T15:11:58+00:00

Miklós Homolya
Does UFLACS not have an option that leaves preintegration on but turns off unrolling?
- 2018-02-23T15:26:52+00:00
Jan Blechta reporter
This logic pretty much documents what modes are possible.
- 2018-02-23T18:42:01+00:00
Jan Blechta reporter
And this method runs when block_more="preintegrated" and seems pretty much responsible for unrolling in tensor indices and quadrature loop.
- 2018-02-23T18:47:03+00:00
Fabian Löschner
EDIT: The code presented below produces wrong results - I just presented it here to demonstrate that the issue vanishes if the assignments don't get unrolled.

I enforced that preintegrated tables don't get inlined and modified the generate_preintegrated_dofblock_partition() method such that it creates loops, e.g. in your example
```
for (int i = 0; i < 44; ++i)
        for (int j = 0; j < 44; ++j)
            A[i * j] += sp[409] * PI44[0][i][j];
```
and so on. With these modifications, gcc compiles the example without problems. So maybe it is sensible to define an upper size limit for inlining of the preintegration tables?

Regarding my modification of the method for testing: I replaced the loop
```
for ii in itertools.product(*blockranges):
    ...
```
with
```
A_idx = 1
for index_symbol in arg_indices:
    A_idx *= index_symbol

P_arg_indices = arg_indices
P_arg_indices_rev = tuple(reversed(arg_indices))
if blockdata.transposed:
    P_arg_indices, P_arg_indices_rev = P_arg_indices_rev, P_arg_indices

P_ii = P_entity_indices + P_arg_indices
A_rhs = f*PI[P_ii]
loop_code = L.AssignAdd(A[A_idx], A_rhs)

for idx, blockrange in enumerate(blockranges):
    loop_code = L.ForRange(P_arg_indices_rev[idx], blockrange.start, blockrange.stop, loop_code)

parts += [loop_code]
```
and directly returned the parts list without calling generate_tensor_value_initialization() afterwards. Obviously it's just a hack but it demonstrates that it compiles without inlining+unrolling. Additionally it would make sense in this particular example to join all tables to a single array in order to fuse all table loops.

If you think that this is the right direction, I can flesh out my solution a little bit more, run the tests, etc.
- 2018-03-18T21:55:01+00:00
Fabian Löschner
- attached tsfc_spectral_works_ffc_form.cpp
- attached tsfc_tensor_works_ffc_form.cpp
- attached uflacs_broken_ffc_form.cpp
For reference, the generated code that breaks gcc (uflacs_broken_ffc_form.cpp) and the working code generated by tsfc in spectral mode (tsfc_spectral_works_ffc_form.cpp) and in tensor mode (tsfc_tensor_works_ffc_form.cpp)
- 2018-03-20T20:23:24+00:00
Fabian Löschner
I tried an alternative by splitting each A[...] = assignment into multiple assignments, e.g. that there are only 8 variables in each sum instead of previously ~90. Unsurprisingly, this did not help. GCC still eats memory when compiling the generated code. Apparently this transformation did not prevent the optimization which is responsible for the problem.

Furthermore, I continued to follow the approach I mentioned above and improved it to actually produce correct code. I ran the regression tests on my changes and obviously it failed the code diff checks but all result validation checks were successful. The full code of my changes can be found in this commit. The generated code looks pretty ugly, e.g.:
```
int k = 0;
int l = 0;
for (int i = 8; i < 45; ++i)
{
    for (int j = 0; j < 7; ++j)
    {
        A[45 * i + j] += sp[409] * PI44[0][k][l];
        ++l;
    }
    for (int j = 8; j < 45; ++j)
    {
        A[45 * i + j] += sp[409] * PI44[0][k][l];
        ++l;
    }
    l = 0;
    ++k;
}
k = 0;
```
but the code generation by FFC/UFLACS for this example is much faster than before and GCC compiles it without any problems. As a heuristic for switching to this behavior I currently use whether one of the pre-integration tables is larger than 1024 values (the example that triggered this issue has 44*44 = 1936 entries in the largest table). For the regression tests I used a limit of 10 entries to ensure that it works with all test cases.
- 2018-03-21T16:43:07+00:00
Fabian Löschner
- attached issue_173.cpp
Example code without dependencies to reproduce the issue with external compilers
- 2018-03-25T11:45:14+00:00
Fabian Löschner
Reported as GCC bug https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85072
- 2018-03-25T12:10:21+00:00
Jan Blechta reporter
- changed status to resolved
Fixed by a9f9111. Great job @w1th0utnam3 !
- 2018-05-03T12:00:13+00:00
Log in to comment

Assignee: Jan Blechta

Type: bug

Priority: critical

Status: resolved

Component: uflacs-repr

Milestone: 2018.1

Version: dev

Votes: 0

Watchers: 3