- changed component to uflacs-repr
UFLACS preintegration emits inefficient code
from dolfin import *
import dijitso; dijitso.set_log_level("debug")
fcp = parameters['form_compiler']
#fcp.add('enable_preintegration', False) # Fixes the problem
mesh = UnitCubeMesh(1, 1, 1)
V = FunctionSpace(mesh, 'Nedelec 1st kind H(curl)', 3)
u, v = TrialFunction(V), TestFunction(V)
a = (inner(curl(u), curl(v)) - Constant(1)*inner(u, v))*dx
assemble(a)
C++ compiler eats over 10 GB of memory when trying to compile above. The problem is present in releases 2017.1.0 and 2017.2.0.
Comments (12)
-
reporter -
Is it because of too much unrolling? Do you have the same problem with TSFC tensor mode?
-
reporter No, tsfc keeps loops of rows and columns of the cell tensor for both
spectral
(default in FFC) andtensor
. To summarize, withfcp['representation'] = 'uflacs'
g++ eats more than 10GB of memory (I don't know how much actually, I killed it), because the loops over
i in range(45)
,j in range(45)
get unrolled.fcp['representation'] = 'uflacs' fcp.add('enable_preintegration', False)
works fine as well as
fcp['representation'] = 'tsfc' fcp.add('mode', 'spectral') COFFEE finished in 0.0179336 seconds (flops: 1350658 -> 1350658)
and
fcp['representation'] = 'tsfc' fcp.add('mode', 'tensor') COFFEE finished in 0.0176239 seconds (flops: 365223 -> 365223)
-
Does UFLACS not have an option that leaves preintegration on but turns off unrolling?
-
reporter This logic pretty much documents what modes are possible.
-
reporter And this method runs when
block_more="preintegrated"
and seems pretty much responsible for unrolling in tensor indices and quadrature loop. -
EDIT: The code presented below produces wrong results - I just presented it here to demonstrate that the issue vanishes if the assignments don't get unrolled.
I enforced that preintegrated tables don't get inlined and modified the
generate_preintegrated_dofblock_partition()
method such that it creates loops, e.g. in your examplefor (int i = 0; i < 44; ++i) for (int j = 0; j < 44; ++j) A[i * j] += sp[409] * PI44[0][i][j];
and so on. With these modifications, gcc compiles the example without problems. So maybe it is sensible to define an upper size limit for inlining of the preintegration tables?
Regarding my modification of the method for testing: I replaced the loop
for ii in itertools.product(*blockranges): ...
with
A_idx = 1 for index_symbol in arg_indices: A_idx *= index_symbol P_arg_indices = arg_indices P_arg_indices_rev = tuple(reversed(arg_indices)) if blockdata.transposed: P_arg_indices, P_arg_indices_rev = P_arg_indices_rev, P_arg_indices P_ii = P_entity_indices + P_arg_indices A_rhs = f*PI[P_ii] loop_code = L.AssignAdd(A[A_idx], A_rhs) for idx, blockrange in enumerate(blockranges): loop_code = L.ForRange(P_arg_indices_rev[idx], blockrange.start, blockrange.stop, loop_code) parts += [loop_code]
and directly returned the
parts
list without callinggenerate_tensor_value_initialization()
afterwards. Obviously it's just a hack but it demonstrates that it compiles without inlining+unrolling. Additionally it would make sense in this particular example to join all tables to a single array in order to fuse all table loops.If you think that this is the right direction, I can flesh out my solution a little bit more, run the tests, etc.
-
- attached tsfc_spectral_works_ffc_form.cpp
- attached tsfc_tensor_works_ffc_form.cpp
- attached uflacs_broken_ffc_form.cpp
For reference, the generated code that breaks gcc (
uflacs_broken_ffc_form.cpp
) and the working code generated by tsfc in spectral mode (tsfc_spectral_works_ffc_form.cpp
) and in tensor mode (tsfc_tensor_works_ffc_form.cpp
) -
I tried an alternative by splitting each
A[...] =
assignment into multiple assignments, e.g. that there are only 8 variables in each sum instead of previously ~90. Unsurprisingly, this did not help. GCC still eats memory when compiling the generated code. Apparently this transformation did not prevent the optimization which is responsible for the problem.Furthermore, I continued to follow the approach I mentioned above and improved it to actually produce correct code. I ran the regression tests on my changes and obviously it failed the code diff checks but all result validation checks were successful. The full code of my changes can be found in this commit. The generated code looks pretty ugly, e.g.:
int k = 0; int l = 0; for (int i = 8; i < 45; ++i) { for (int j = 0; j < 7; ++j) { A[45 * i + j] += sp[409] * PI44[0][k][l]; ++l; } for (int j = 8; j < 45; ++j) { A[45 * i + j] += sp[409] * PI44[0][k][l]; ++l; } l = 0; ++k; } k = 0;
but the code generation by FFC/UFLACS for this example is much faster than before and GCC compiles it without any problems. As a heuristic for switching to this behavior I currently use whether one of the pre-integration tables is larger than 1024 values (the example that triggered this issue has 44*44 = 1936 entries in the largest table). For the regression tests I used a limit of 10 entries to ensure that it works with all test cases.
-
- attached issue_173.cpp
Example code without dependencies to reproduce the issue with external compilers
-
Reported as GCC bug https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85072
-
reporter - changed status to resolved
Fixed by a9f9111. Great job @w1th0utnam3 !
- Log in to comment