SIMD + OpenMP

Benedikt_R_ · ‎04-25-2017

Hi

I'm using Intel-Fortran 2016.

Currently my program has a loop

!$OMP PARALLEL DO reduction(min:delt_) PRIVATE(LEDG,...) SHARED(...)
      DO 10 LWETEDG = 1,NWETEDGE

Advisor-Profiler point's out, that this loop is not vectorized, So I changed to

!$OMP DO SIMD reduction(min:delt_) PRIVATE(LEDG,...) SHARED(...)
      DO 10 LWETEDG = 1,NWETEDGE

After this change the Advisor-Profiler point's out, that this loop is vecotrized, but the loop seems to be eight times slower.

Since I used OMP_NUM_THREAD = 8 I assume, the loop is not multithreaded anymore.

Question: What is the proper way to create loops which are multithreaded AND SIMD-parallelized?

Bye

Benedikt

TimP · ‎04-25-2017

You have changed from threaded reduction to single thread simd reduction. That is often a good choice, and likely to produce superior performance on a small number of cores.

If you want both omp parallel and omp simd reduction, you may need to write explicitly with nested loops and separate named inner and outer reduction variables. Bear in mind that this means batching with separate implicit partial reductions for each thread and (probably multiple riffled baches for) each simd lane. I have not yet seen a successful example of a single loop omp simd reduction, which would be written with

!$omp parallel do simd reduction(......

If your code is simple enough to use Fortran intrinsics such as MINVAL or MINLOC, those are preferable in place of an inner simd loop. e.g.

          max_= aa(1,1)
          xindex=1
          yindex=1
!$omp parallel do private(ml) if(n>103) reduction(max: max_)              &
!$omp& lastprivate(xindex,yindex) firstprivate(yindex)
          do j=1,n
              ml= maxloc(aa(:n,j),dim=1)
              if(aa(ml,j)>max_ .or. aa(ml,j)==max_ .and. j<yindex)then
                  xindex= ml
                  yindex= j
                  max_=aa(ml,j)
                endif
            enddo

The usual way of using auto-vectorizing compilers allows these intrinsics to be vectorized regardless of their inclusion in an OpenMP block. ifort option -Qno-vec changes that to allowing simd vectorization only under explicit !$omp simd directive (which doesn't work with these intrinsics).

If you don't care how ties are broken among threads, you could skip the comparisons aimed at ensuring that the first occurrence is used in case of ties. This implies that your ties may not break consistently.

BLOCK might appear to avoid the introduction of simple private declarations such as ml in this example, but apparently OpenMP doesn't (yet?) allow BLOCK within an OpenMP block.

Don't forget that the legacy non-omp simd reduction is particularly sensitive to requiring that all reductions and first/lastprivates be explicit. If your compiler is too old to support omp 4.0 the legacy directives may not be a satisfactory substitute.

Benedikt_R_ · ‎04-25-2017

Thank you for taking care, Tim.

!$omp parallel do simd reduction(......

Great! I tried some combinations, but didn't get this one. This one compiles. I'll spend some further
investigations...

If your code is simple enough ....

No - it is not simple enough. I doubt, that vectorization will be advantageous anyway. But I want to give
it a trial.

If your compiler is too old...

It's "Intel Fortran Compiler, Version 16.0.3.207". As far as I know, this is rather topical. (?)

Thanks

Benedikt