Improve Compilation Time

Issue #111 new
Nils Deppe created an issue

Hi Klaus,

It would be great to have Blaze compile several times faster than it currently does. Adding Blaze to our code doubled compilation time, which is a big regression to accept. It might be worth looking at Templight to profile the compilation and see what's taking the most time to generate. Typically SFINAE and type creation (such as structs) are by far the most expensive. I noticed a lot of Blaze classes have several member structs that only contain an enum : bool. This might be a place to start looking to decrease compilation time.

As a side note, if the compilation time remains as is we will probably end up not using Blaze.

Best,

Nils

Comments (12)

  1. Klaus Iglberger

    Ni Nils!

    Thanks for raising this issue. That is indeed a very important task that we will definitely look into to see what we can do to improve compile times. However, this is of course an open ended task since we will always have to take care of this. As a short term solution, please try only to include the features that you need. For instance, in case you are only interested in CustomVector, it is sufficient to include <blaze/math/CustomVector.h> instead of <blaze/Blaze.h>, which includes the entire functionality. This already significantly decreases compile time (approx. 30%).

    Please keep in mind that one of the primary idea of Blaze is to do as much at compile time as possible. In comparison to other libraries Blaze has many more decision paths, for instance to restructure the expressions within a statement. This adds to the compile time. Also, Blaze in many cases does not generalize in order to maximise performance even in corner cases, which results in more code. Third, Blaze is the only library that is not limited to fundamental and complex element types but that enables you to use arbitrary element types. This also adds complexity (more kernels, more complexity, more code, ...). Fourth, Blaze is the only C++ math library that is fully parallel (yet another reason for additional code and complexity). And finally, in Blaze we also try to cover the special cases. For instance, only Blaze provides optimisations for all possible use cases of an identity matrix. This also adds code and complexity. In summary, Blaze will unfortunately never compile as fast as other libraries. However, we are confident that it will result in faster runtime code than any other library (especially when parallelization is used) and will also cover all possible special cases.

    We unfortunately cannot promise that compilation times will significantly decrease in the near future. The only thing that we can promise is that we will devote effort to improve compile times as much as possible.

    Best regards,

    Klaus!

  2. Nils Deppe reporter

    Hi Klaus,

    I will try reducing what is included, hopefully it'll help enough. I don't think the argument that template heavy C++ takes long to compile is a good one. See Odin Holmes recent work on the subject. People understand what aspects of TMP is slow and how to improve compile time, it is mostly a matter of effort. However, I sympathize with the difficulty of the task (why I linked to Templight before). We manipulate many large typelists in our code, and make extensive use of TMP and have very acceptable compile times. Just like runtime performance, it is a matter of profiling and then optimizing the code. I do fear that to truly bring down compile time would require a very substantial rethinking of the way Blaze is implemented, e.g. all the member structs of class templates should very likely be removed with type aliases.

    Best,

    Nils

  3. Nils Deppe reporter

    Hi Klaus,

    I just ran an empty executable that just includes Blaze through templight to see what's so slow... First, just including Blaze took the compile time to 3 seconds, even though I'm not using any of it. That seems like a pretty steep price to pay for literally doing nothing. Interestingly, it seems you are spending a lot of time in EnableIf_ (you can probably reduce this by calling ::value and not having a type alias that takes a template parameter). I'm still rather confused why anything is happening at all just from including the library. I was expecting/hoping for effectively zero overhead. I even made sure it wasn't the preprocessor that's taking up the time. I think the solution will ultimately be quite close to a complete rewrite of the way Blaze deals with templates and SFINAE. Even just cleaning up all the EnableIf stuff will be quite tedious and time consuming, and I suspect there are many other things that are causing such a crazy slow down (maybe too many classes that are not templates and are therefore getting created, or just excessive use of SFINAE). Calling the max function also seems to be taking up a lot of time...

    Anyway, hopefully you'll find some time to work on this. Regardless I'll be looking to replace Blaze with something that compiles faster and provides the same runtime performance for the manipulations we're interested in. For the size of matrices we're interested in LIBXSMM destroys Blaze and the MKL in terms of performance anyway, so we only care about pointwise operations using expression templates.

    Best,

    Nils

  4. Ray Kim

    I also suggest that, though not common, library constructs for basic types such as float, double can be precompiled using template specialization. I think this can be considered too

  5. Daniel Baker

    I will say that as a heavy user of blaze, I find the many using declarations essential to usage. I'm extremely happy with its runtime performance, and I don't think that reducing compilation-time (a relatively small price to pay, especially one that's already parallelized across all source files) is worth more effort than expanding functionality. Given limited resources and a large number of important applications, I would emphasize the latter.

  6. Nils Deppe reporter

    @dnbh what using declarations are you talking about? Also, using declarations are not the part that makes compilation slow, it is the creation of many intermediate types that is slow. For example, the type traits are all implemented in a very strange manner that sometimes creates three or more types when only one is necessary. Then, when they are used it often ends up being 10 or more types instead of 2. Another slow factor seems to be parse time, but I haven't had time to fully understand that yet.

    Regarding that compilation is parallelized, you are paying for the Blaze compilation time in each source file that includes it, so no you are not increasing performance, but actually decreasing it by parallelization. The other factor is that the creation of many intermediate types uses a lot of memory, making parallelization sometimes impossible because of memory constraints. I shouldn't need 64GB of RAM to compile my code because I use Blaze...

  7. Klaus Iglberger

    Hi Nils!

    Could you please give an example for a type trait that is implemented in a strange manner and due to type bloat causes a provable compile time decrease? Could you also please provide an example for how the type trait should be implemented instead?

    Best regards,

    Klaus!

  8. Nils Deppe reporter

    Hi Klaus!

    One example is here: https://bitbucket.org/blaze-lib/blaze/src/bedc9169532622139e00055fd49f10c13f059b8a/blaze/math/typetraits/HasSIMDAtan2.h?at=master&fileviewer=file-view-default

    which could be implemented roughly as:

    template <class T1, class T2>
    struct HasSIMDAtan2_impl : std::false_type {};
    
    template <class T1, class T2>
    struct HasSIMDAtan2_impl<double, double> :
        std::integral_constant<bool,  bool( BLAZE_SSE_MODE     ) ||
                             bool( BLAZE_AVX_MODE     ) ||
                             bool( BLAZE_MIC_MODE     ) ||
                             bool( BLAZE_AVX512F_MODE )> {};
    
    // Same for float
    
    template <class T1, class T2>
    using HasSIMDAtan2 = typename HasSIMDAtan2<std::decay_t<T1>, std::decay_t<T2>>::type;
    
    // If variable templates were supported:
    template <class T1, class T2>
    constexpr bool HasSIMDAtan2_v = HasSIMDAtan2<T1, T2>::value;
    

    Now the other part that's not entirely clear is why there are different traits for Atan2, Cos, etc. other than that the BLAZE_blah_MODEs may be different. If that is the only reason then this could be further decreased to:

    template <class T1, class T2>
    struct are_float_or_double : std::false_type {};
    template <>
    struct are_float_or_double<double, double> : std::true_type {};
    template <>
    struct are_float_or_double<float, float> : std::true_type {};
    
    template <class T1, class T2>
    using HasSIMDAtan2 = std::integral_constant<bool,
        are_float_or_double<std::decay_t<T1>, std::decay_t<T2>>::value
        and (bool( BLAZE_SSE_MODE     ) ||
             bool( BLAZE_AVX_MODE     ) ||
             bool( BLAZE_MIC_MODE     ) ||
             bool( BLAZE_AVX512F_MODE ))>;
    

    which would reduce the number of traits significantly, reduce the number of lines by somewhere around 1000 and if applied to other places in the code by maybe even 4000+, and the number of templates significantly too. The example I showed here creates four types, std::integral_constant<bool, true> (or false), the two std::decay<T> and the are_float_or_double. The type alias lookups are also really fast (see Odin Holmes's talk I linked to previously). The current implementation in Blaze creates 8 types, and won't benefit from memoization by the same amount as my implementation.

    Now, I don't expect any single type trait change to alter compilation time by a noticeable amount. It would require basically propagating the change throughout the entire code base and then measuring. It's really difficult to measure a single small change in class instantiation because you're profiling the compiler and it's doing so many other things that it's difficult.

    Another unnecessary trait used by Blaze a lot is Or and And instead of calling ::value and then just using or and and at the value level. I think the above suggested changes would be good even just to improve code maintainability and to simplify the type traits. Right now this is all much more complicated that it needs to be.

    It would be good to use Michal Dominiak's benchmarking method: https://www.youtube.com/watch?v=OVJzn93FcAk&t=1s to figure out what is happening a bit more.

    If you're open to pull requests for this, it is something that I'd be willing to help work on. However, I don't want to take the time to change dozens to hundreds of type traits to have the PR be rejected. Hopefully a decrease in the number of types by a factor of 2 or more, and also a decrease in the number of types created is a convincing argument that this is a win for the code base regardless of compilation performance change, but should help with compilation time anyway.

    Hopefully that helps a bit!

    Cheers,

    Nils

  9. Klaus Iglberger

    Hi Nils!

    Thanks for the feedback. We will try and see what can be gained by this.

    Best regards,

    Klaus!

  10. Nils Deppe reporter

    Hi Klaus,

    You're welcome! Please let me know if it helps. If you would like, I can try it as well and see what happens. How does that sound?

    Cheers,

    Nils

  11. Klaus Iglberger

    Hi Nils!

    Here is some feedback on the issue on compilation times. We have tested several options, among others your proposed ideas. Whereas reworking type traits gives some improvement (the expectation is in the low one digit range), removing And, Or, and Not and reworking EnableIf, DisableIf and If has significant impact on the compilation time. Therefore we now consider this a reasonable and required modification. We expect the necessary changes to take at least a week, though, as there are for instance approx. 2500 instances of EnableIf in the math module of Blaze.

    At this point in time, we unfortunately would have to do without variable templates. We are required to support the Intel 16 compiler, which does not support this C++14 features (see the official Intel feature list). However, without variable templates the refactoring effort is even bigger and would even cause double effort because by the time we introduce the feature we would have to change everything again. In order to save time and effort, we will therefore combine the introduction of variable templates with the mentioned modifications. Our plan is to finish Blaze 3.3 first and focus on compilation time at the beginning of Blaze 3.4.

    I hope this helps to give you the certainty that compilation times will improve in the foreseeable future.

    Best regards,

    Klaus!

  12. Nils Deppe reporter

    Hi Klaus!

    Thank you for looking into this, I really appreciate that! It sounds like you have a good plan moving forward and I think the timeline is very reasonable. I knew the changes needed to improve compile times would be quite significant :) I'm also not surprised by And, Or and Not being candidates for the biggest improvements. Once you're making the changes, I'd be happy to do the type traits to squeeze out any last bit of compilation time reductions :)

    I understand the issue with Intel compilers not supporting C++ features. We have the luxury of being able to abandon the Intel compiler and rely on Blaze to get us the vectorization, rather than a compiler that is tuned to optimizing loops ;) Of course, as an HPC targeted library that's not something Blaze can realistically do.

    Thank you again for all the hard work :)

    Best,

    Nils

  13. Log in to comment