vectorization tuning on NEC vector engine

I’m trying to port GS2 to the new Plasma Simulator at NIFS, where the main computation is performed on NEC’s vector engine (SX Aurora TSUBASA). With the help of an NEC engineer, it turns out that, to enhance vectorization, we should make a major change in loop structure as shown in the branch: optimize-raijin-by-nec. So, I’d like to discuss if this change is acceptable or not, and if not, I’d like to ask for advice if there are other options to take.

Because NEC compiler vectorizes the inner most loop, the inner most loop should be iglo loop, so itgrid and iglo loops should be swapped. The main part which impacts the performance is invert_rhs. The memory layout should also be changed, but it’ll be another big change, and I haven’t tried it. I thought that it’s nice if I can choose which loop to be vectorized by a directive. But unfortunately, such a directive does not exist so far.

The attached figure shows the strong scaling using benchmarks/timestep on Plasma Simulator and JFRS-1 with cray and intel compilers. Without this change, we apparently cannot take advantage of the vector engine. Using the same code, I’ve checked if the change makes an impact on other systems, eg. JFRS-1. The performance is almost the same with Cray compiler, but slightly degrade with Intel. Note that I haven’t thoroughly worked out the code, but just tested the idea.

Comments (2)