Vectorization for Riemann solvers for the single-layer shallow water equations #291

chaulio · 2018-04-10T14:02:11Z

These changes introduce new vectorized implementations for the single-layer shallow water equations (chile2010 example). The user can easily switch between the serial and the vectorized code: to use vectorization, one needs to define GEOCLAW_VEC=1 before compiling. More info in the chile2010 Makefile.

…equations (chile2010 example). To use it, define GEOCLAW_VEC=1 before compiling. More info in the chile2010 Makefile.

mandli · 2018-04-10T21:57:42Z

This looks good to me. @rjleveque do you have any thoughts as to whether this should be a separate example? Do we want to add a test for this version as well?

rjleveque · 2018-04-10T23:37:34Z

Do I understand correctly that the main differences are

Removing some explicit loops, using : notation more,
Reordering the indices in some variables. Mostly local variables private to subroutines, but also all the arrays passed in and out of Riemann solvers.
Adding directives !DIR$ SIMD to some loops.

Questions:

Anything else?
What compilers does this directive work with? A bit of Googling seems to indicate this is an Intel directive, but maybe there's a !$OMP SIMPD that is equivalent? If so why not use that?
We should be clear somewhere about what compilers it works with and more generally document it in the docs at some point.
Are some of these changes worth making in the main version of the code rather than having a separate set of routines that need to be maintained? I guess the problem is many changes would affect all the Riemann solvers?

chaulio · 2018-04-11T12:46:54Z

@rjleveque:

This code was tested with Intel Compilers 16, 17 and 18. You are right, the !DIR SIMD directive is Intel-specific (I didn't know about that). So, I replaced them with !$OMP SIMD, which is more portable.
The !$OMP SIMD directives should work with any compiler that supports OpenMP 4.0 or later. Unfortunately it seems that they won't work with GNU 4.x or earlier versions.
As you could see, I had to switch the order of the array indices in many parts of the code. The changes are trivial, but merging this into the main version would indeed be problematic for other implementations of Riemann solvers.

rjleveque · 2018-04-11T23:53:31Z

I haven't done extensive tests, but this seems to be slower than the original on my laptop.

I tried the chile2010 problem with the refinement ratios changed to [4,6] so that it runs a bit longer and set either GEOCLAW_VEC=0 or GEOCLAW_VEC=1. I tried each one three times and got these timings (wall time and CPU time in the columns):

$ grep 'stepgrid  ' _output_VEC*/fort.amr

_output_VEC0a/fort.amr:stepgrid               83.141                153.596    
_output_VEC0b/fort.amr:stepgrid               87.272                163.599    
_output_VEC0c/fort.amr:stepgrid               86.267                160.387
    
_output_VEC1a/fort.amr:stepgrid               93.495                170.048    
_output_VEC1b/fort.amr:stepgrid               95.048                174.852    
_output_VEC1c/fort.amr:stepgrid               93.657                171.370

Presumably it works better with the Intel compiler?

But we may have bigger problems -- these results seemed strange to me since I was running with 4 threads but only about a factor of 2 improvement in wall time. So I ran the same problem with v5.4.1 and got these results:

$ grep 'stepgrid  ' _output_v541*/fort.amr

_output_v541a/fort.amr:stepgrid               48.835                189.820    
_output_v541b/fort.amr:stepgrid               45.118                176.417    
_output_v541c/fort.amr:stepgrid               45.393                177.072

We need to sort this out first...

chaulio · 2018-04-12T08:26:58Z

I am not sure whether !$OMP SIMD works better with the Intel Compiler, but the benefits of vectorization are for sure very small on processors with 256-bit instructions. They should be more noticeable with 512-bit instructions (KNC, KNL, Skylake-X...).

I performed some experiments for a paper we submitted last month using Haswells (E5-2697v3) and KNLs (7210F) and got overall speedups of 1.6x and 2.3x, respectively). I also used a larger run in these tests, with 100x100 cells on the first level and [6,8] as refinement ratios.

Also, before submitting this pull request I also made a small test with 100x100 and [2,6] on the same Haswell machine and there was a small 1.3x improvement in the wall time.

I think that the fact that the performance depends on the compiler, machine, and simulation size is yet another motivation to keep vectorization as an optional feature, at least for now.

chaulio · 2018-04-12T09:47:18Z

I just ran the code with 100x100 and [2,6] and 8 threads on the Haswell machine, and this is what I got (with Intel Compiler 17.0):

GEOCLAW_VEC=0 --> stepgrid               89.976                717.657    
GEOCLAW_VEC=1 --> stepgrid               65.268                520.416

Now with the GNU compiler 5.4.0:

GEOCLAW_VEC=0 --> stepgrid               98.217                777.229    
GEOCLAW_VEC=1 --> stepgrid              103.881                821.955

So, apparently, it is not working with the GNU compiler. The !$OMP SIMD loops are probably not getting vectorized. But there is no problem with OpenMP scaling.

mandli · 2018-04-30T17:32:04Z

@chaulio Would it be easy to combine the vectorized source and non-vectorized source and use a compile time flag to compile one way or the other? This would better ensure that the code would be updated and your effort sustainable. We were also a bit concerned that putting this in the Chile example directly might get some users to get confused regarding when they should or should not use the vectorized version.

chaulio · 2018-05-01T13:10:53Z

@mandli Do you mean using macros like #if defined(_VEC) to switch between codes?

It would not be hard, but I am afraid it would require dozens of such macros in a few different files, so I don't think it would be good for the code organization. The problem is that, while the most important changes are in the Riemann solvers rpn2 and rpt2, there are still many small changes in different files (like step2 and flux2) that were necessary to switch the data layout from AOS to SOA.

If I understand correctly, these are the two options we are considering right now, and each has its pros and cons (maybe you can think of other pros and cons, or even of other options):

- Option 1: Use separate files for the new code and use the Makefile to decide which files to include during compilation. (This is how the code in the pull request is right now).
Pro: The original code remains unchanged, only a few changes in the Makefile are required.
Con: The new code would hardly be maintained. It will become "outdated" if the original code gets modified, unless the programmer remember to update it as well (very unlikely).

- Option 2: Introduce the new code in the same files as the original code, using macros to switch between implementations (this is your suggestion). Then vectorization could be turned on by simply setting FFLAGS="-D_VEC" or similar, or something could be added to the Makefile to handle it.
Pro: Theoretically this would be easier to maintain.
Con: The original code would be changed and would become a lot messier with dozens of macros.

In practice, I don't think Option 2 is easier to maintain. We would still have two different implementations, but now they would be mixed in the same file. It will still be necessary for the programmer to explicitly modify both implementations.

As a small example, below I show how inlinelimiter.f will look like if we use Option 2 (notice the '#' macros). Keep in mind that the changes necessary in this file were minimal. You should expect a lot more macros in other files, especially step2 and flux2.

c
c
c     =====================================================
      subroutine limiter(maxm,meqn,mwaves,mbc,mx,wave,s,mthlim)
c     =====================================================
c
c     # Apply a limiter to the waves.  
c
c     # Version of December, 2002.
c     # Modified from the original CLAWPACK routine to eliminate calls 
c     # to philim.  Since philim was called for every wave at each cell
c     # interface, this was adding substantial overhead in some cases.
c
c     # The limiter is computed by comparing the 2-norm of each wave with
c     # the projection of the wave from the interface to the left or
c     # right onto the current wave.  For a linear system this would
c     # correspond to comparing the norms of the two waves.  For a 
c     # nonlinear problem the eigenvectors are not colinear and so the 
c     # projection is needed to provide more limiting in the case where the
c     # neighboring wave has large norm but points in a different direction
c     # in phase space.
c
c     # The specific limiter used in each family is determined by the
c     # value of the corresponding element of the array mthlim.
c     # Note that a different limiter may be used in each wave family.
c
c     # dotl and dotr denote the inner product of wave with the wave to
c     # the left or right.  The norm of the projections onto the wave are then
c     # given by dotl/wnorm2 and dotr/wnorm2, where wnorm2 is the 2-norm
c     # of wave.
c
      implicit double precision (a-h,o-z)
      dimension mthlim(mwaves)
#if !defined(_VEC)
      dimension wave(meqn, mwaves, 1-mbc:maxm+mbc)
      dimension    s(mwaves, 1-mbc:maxm+mbc)
#else
      dimension wave(1-mbc:maxm+mbc, meqn, mwaves)
      dimension    s(1-mbc:maxm+mbc, mwaves)
#endif
c
c
      do 200 mw=1,mwaves
         if (mthlim(mw) .eq. 0) go to 200
         dotr = 0.d0
         do 190 i = 0, mx+1
            wnorm2 = 0.d0
            dotl = dotr
            dotr = 0.d0
            do 5 m=1,meqn
#if !defined(_VEC)
               wnorm2 = wnorm2 + wave(m,mw,i)**2
               dotr = dotr + wave(m,mw,i)*wave(m,mw,i+1)
#else
               wnorm2 = wnorm2 + wave(i,m,mw)**2
               dotr = dotr + wave(i,m,mw)*wave(i+1,m,mw)
#endif
    5          continue
            if (i.eq.0) go to 190
            if (wnorm2.eq.0.d0) go to 190
c
#if !defined(_VEC)
            if (s(mw,i) .gt. 0.d0) then
#else
            if (s(i,mw) .gt. 0.d0) then
#endif
                r = dotl / wnorm2
              else
                r = dotr / wnorm2
              endif
c
            go to (10,20,30,40,50) mthlim(mw)
c
   10       continue
c           --------
c           # minmod
c           --------
            wlimitr = dmax1(0.d0, dmin1(1.d0, r))
            go to 170
c
   20       continue
c           ----------
c           # superbee
c           ----------
            wlimitr = dmax1(0.d0, dmin1(1.d0, 2.d0*r), dmin1(2.d0, r))
            go to 170
c
   30       continue
c           ----------
c           # van Leer
c           ----------
            wlimitr = (r + dabs(r)) / (1.d0 + dabs(r))
            go to 170
c
   40       continue
c           ------------------------------
c           # monotinized centered
c           ------------------------------
            c = (1.d0 + r)/2.d0
            wlimitr = dmax1(0.d0, dmin1(c, 2.d0, 2.d0*r))
            go to 170
c
   50       continue
c           ------------------------------
c           # Beam-Warming
c           ------------------------------
            wlimitr = r
            go to 170
c
  170       continue
c
c           # apply limiter to waves:
c
            do 180 m=1,meqn
#if !defined(_VEC)
               wave(m,mw,i) = wlimitr * wave(m,mw,i)
#else
               wave(i,m,mw) = wlimitr * wave(i,m,mw)
#endif
  180          continue

  190       continue
  200    continue
c
      return
      end

In my opinion, this is only slightly easier to maintain than if we had two different files, with the disadvantage that the code gets messy in some places.

Please consider both options and let me know your opinions. Maybe you can even think of other options to do this.

If you really think Option 2 is the best option, I can do it.

mandli · 2018-05-01T13:42:58Z

Hmm, I think you answered my question as to whether option 2 is a good idea. Perhaps the thing to do is to add a test that would use the vectorized code so that if there is a change that is incompatible it is caught. Another thing that came up I forgot to mention was whether it makes sense to add the vectorization flag to Makefile.geoclaw and document the capability.

chaulio · 2018-05-02T15:31:20Z

I am not sure if it makes sense to add it to Makefile.geoclaw, because changes to the Makefile in each example directory will still be necessary (because we compile different files for rpn2 and rpt2, and these are not included in Makefile.geoclaw).

I think adding the tests is a good idea. In fact, it is a trivial task, assuming that there already are tests for the original code. All one needs to do is run the same test again but adding the vectorization flag to the compilation.

mandli · 2018-05-02T16:33:10Z

Exactly what I was thinking. You may be able to just modify the regression test itself and recompile through a sub-class of the test.

For the Makefile business I get what you are saying. @rjleveque do you have an oppinion about whether to leave the Makefile in the chile example alone and stick this in the apps repository or is it ok to leave that detail in the Makefile as is with the warning that this only works on fancier chips and the intel compilers?

mandli · 2018-05-16T00:07:53Z

@chaulio we are going to be getting a bunch of the GeoClaw developers in one spot next week and it would be nice to merge this PR in by then. Do you need help with adding the tests? Also @rjleveque do you have an idea about the above question?

…bowl_slosh_vec)

rjleveque · 2018-05-17T17:03:27Z

I added this as a discussion point for the workshop next week. Since it is fairly specialized and the implementation is perhaps not yet finalized I would be inclined to put the example using this in the apps repository for now at least, rather than in the main examples, or maybe add to the example but as a separate Makefile_vectorized that has some additional comments included?

Vectorization for Riemann solvers for the single-layer shallow water …

7cafbe9

…equations (chile2010 example). To use it, define GEOCLAW_VEC=1 before compiling. More info in the chile2010 Makefile.

flux2fw_vec.f: replaced !DIR$ SIMD directives with !$OMP SIMD

ddf66ff

rjleveque mentioned this pull request Apr 11, 2018

Added vectorized Riemann solvers for the single-layer shallow water equations clawpack/riemann#133

Merged

Created test for vectorized Riemann solvers on a separate directory (…

4f07e98

…bowl_slosh_vec)

mandli assigned mandli and chaulio May 25, 2018

mandli added this to the 5.7.0 milestone Aug 16, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vectorization for Riemann solvers for the single-layer shallow water equations #291

Vectorization for Riemann solvers for the single-layer shallow water equations #291

chaulio commented Apr 10, 2018

mandli commented Apr 10, 2018

rjleveque commented Apr 10, 2018

chaulio commented Apr 11, 2018

rjleveque commented Apr 11, 2018

chaulio commented Apr 12, 2018

chaulio commented Apr 12, 2018 •

edited

mandli commented Apr 30, 2018

chaulio commented May 1, 2018

mandli commented May 1, 2018

chaulio commented May 2, 2018

mandli commented May 2, 2018

mandli commented May 16, 2018

rjleveque commented May 17, 2018

Vectorization for Riemann solvers for the single-layer shallow water equations #291

Are you sure you want to change the base?

Vectorization for Riemann solvers for the single-layer shallow water equations #291

Conversation

chaulio commented Apr 10, 2018

mandli commented Apr 10, 2018

rjleveque commented Apr 10, 2018

chaulio commented Apr 11, 2018

rjleveque commented Apr 11, 2018

chaulio commented Apr 12, 2018

chaulio commented Apr 12, 2018 • edited

mandli commented Apr 30, 2018

chaulio commented May 1, 2018

mandli commented May 1, 2018

chaulio commented May 2, 2018

mandli commented May 2, 2018

mandli commented May 16, 2018

rjleveque commented May 17, 2018

chaulio commented Apr 12, 2018 •

edited