- removed comment
Enable Vectorisation in McLachlan
Kranc generated thorns now have support for optimisation through vectorisation. They just need the option UseVectors -> True to be set when creating the thorn. The attached patches enable this for the BSSN thorns in McLachlan.
I have tested that this gives a significant performance increase (close to 2x in the right-hand-sides in the cases I tried) and also that it agrees to within what would be expected given roundoff differences with the results of a BBH simulation with vectorisation disabled. Additionally, the testsuites still pass with this patch applied.
Keyword:
Comments (12)
-
reporter -
- removed comment
For such a drastic change, perhaps we should run the McLachlan tests on several of the more important and exotic machines. See
http://einsteintoolkit.org/release-info/parse_testsuite_results.php
for the machines that we run on before a release.
-
- removed comment
(We don't need explicit patches for autogenerating code; e.g. regenerating configure should not be in a patch either.)
Instead of testing various machines I would test several architectures. In particular we should test:
- SSE 4.1 (modern Intel) - SSE 4a (modern AMD) - SSE 2 (old Intel or AMD) - [VSX (Power 7)] - [Double Hummer (Blue Gene/P)]
I'm not sure about the last two architectures. Without Blue Waters, Power 7 has become much less interesting, although we still have access to such a machine at LSU. We don't use BG/P in production, and probably won't because the architecture is dated; BG/Q will be interesting.
Having said this, testing on Datura, Damiana, and Kraken (with the Intel compiler) should do the trick. We may want to throw in a system with the PGI compiler as well since that compiler needs some special casing in some regions of the code.
-
reporter - removed comment
I agree that this should certainly be well tested before being applied.
Replying to [comment:3 eschnett]:
Instead of testing various machines I would test several architectures. In particular we should test:
- SSE 4.1 (modern Intel) - SSE 4a (modern AMD) - SSE 2 (old Intel or AMD) - [VSX (Power 7)] - [Double Hummer (Blue Gene/P)]
I have verified that the tests pass on SSE 4.1, SSE 4a and SSE 2 machines with vectorisation enabled immediately after commit 4c04a8bc35cf7706e144fe771ba5d6c907f5a455 which was just before the recent schedule changes.
I'm not sure about the last two architectures. Without Blue Waters, Power 7 has become much less interesting, although we still have access to such a machine at LSU. We don't use BG/P in production, and probably won't because the architecture is dated; BG/Q will be interesting.
Unfortunately, I don't have access to any machine with these architectures.
Having said this, testing on Datura, Damiana, and Kraken (with the Intel compiler) should do the trick. We may want to throw in a system with the PGI compiler as well since that compiler needs some special casing in some regions of the code.
I have verified that the McLachan tests pass with vectorisation enabled on these three machines with the Intel compiler. I haven't yet tried with the PGI compiler.
-
- removed comment
The only machine where we use the PGI compiler by default is Hopper at NERSC, and even there the Intel compiler is now available.
-
- removed comment
What I wanted to say is that I think we are ready to apply this patch. Do we agree?
-
reporter - removed comment
Yes, I agree anyway.
-
reporter - removed comment
Replying to [comment:4 barry.wardell]:
I have verified that the tests pass on SSE 4.1, SSE 4a and SSE 2 machines with vectorisation enabled immediately after commit 4c04a8bc35cf7706e144fe771ba5d6c907f5a455 which was just before the recent schedule changes.
Correction: I have verified this only for SSE4.1 and SSE2. Although I tested on Kraken which has CPUs supporting SSE4a, it turns out that the Vectors thorn only used SSE2. I guess the SimFactory optionlist for Kraken must not enable SSE4a? Is there a machine which does compile using SSE4a? Can Kraken be modified to do so?
-
- removed comment
How do you know that SSE 4a was not used? This is autodetected in vectors-8-SSE2.h. It may be that this autodetection is faulty, of course, if e.g. the Intel and GNU compilers use different conventions here.
Most of the vector instructions that we are using are defined in SSE 2. SSE 4.1 defines an instruction that allows a more efficient IfThen implementation, SSE 4a provides a more efficient implementation of a streaming partial store. Since you probably don't use streaming stores (Ian found them slower), it should make no difference whether SSE 4a is present or not.
-
- removed comment
I checked on Kraken, an AMD machine with the Intel compiler. It seems the Intel compiler does not support AMD's SSE extensions, i.e. does not support SSE 4a. (The corresponding include file ammintrin.h is not present.) Therefore, and for the reasons I gave above, let's ignore SSE 4a, and just apply the patch.
-
reporter - changed status to resolved
- removed comment
I have now committed this.
-
- changed status to closed
- edited description
- Log in to comment
I have a second patch which regenerates the code with vectorisation enabled but it is larger than the maximum attachment size allowed by trac so I can't attach it. It is just the result of running make in the 'm' directory.