vectorization tuning on NEC vector engine

Issue #134 new
Ryusuke Numata created an issue

I’m trying to port GS2 to the new Plasma Simulator at NIFS, where the main computation is performed on NEC’s vector engine (SX Aurora TSUBASA). With the help of an NEC engineer, it turns out that, to enhance vectorization, we should make a major change in loop structure as shown in the branch: optimize-raijin-by-nec. So, I’d like to discuss if this change is acceptable or not, and if not, I’d like to ask for advice if there are other options to take.

Because NEC compiler vectorizes the inner most loop, the inner most loop should be iglo loop, so itgrid and iglo loops should be swapped. The main part which impacts the performance is invert_rhs. The memory layout should also be changed, but it’ll be another big change, and I haven’t tried it. I thought that it’s nice if I can choose which loop to be vectorized by a directive. But unfortunately, such a directive does not exist so far.

The attached figure shows the strong scaling using benchmarks/timestep on Plasma Simulator and JFRS-1 with cray and intel compilers. Without this change, we apparently cannot take advantage of the vector engine. Using the same code, I’ve checked if the change makes an impact on other systems, eg. JFRS-1. The performance is almost the same with Cray compiler, but slightly degrade with Intel. Note that I haven’t thoroughly worked out the code, but just tested the idea.

Comments (2)

  1. David Dickinson

    This is interesting. I think we can certainly do more to help vectorisation across the code and invert_rhs and the source calculation would be good places to improve. Changing the memory layout is a big commitment and I’d be worried that it may lead to worse performance in other areas of the code. It would be nice to be able to switch between approaches but I can’t see an easy way to do this currently. One approach could be to transpose the data before entry to vectorised routines but I expect the expense of this is likely to wipe out any gains elsewhere as transpose can be expensive.

    Flagging to @Peter Hill , @Joseph Parker and @Colin Malcolm Roach

  2. Log in to comment