Performance decrease with OMP simd

Grund__Aalexander · ‎05-14-2014

I have some Fortran code out of a benchmark suite that uses OMP 4.0 features. Among them is the new "omp simd" to vectorize parallelized loop nests.
If I omit the "omp simd" the code actually runs faster (around 15% less runtime on a large dataset with ~25minutes runtime where the modified code causes approx. 25-40% of the total runtime) This was tested on an Intel MIC via omp target.

I checked with -vec-report1 comparing the outputs. I get "OpenMP SIMD LOOP WAS VECTORIZED" / "LOOP WAS VECTORIZED" on every one of the loops, so there should not be such a big difference in runtime.

I suppose this is most likely a bug in the compiler. Can you explain the behavior?

Typical usage is like following:

    !$omp do
    DO k=y_min,y_max+1
      !$omp simd
      DO j=x_min-1,x_max+2
        someArray(j,k)=someOtherArray(j,k)-foo(j-1,k)+bar(j,k)
      ENDDO
    ENDDO

    !$omp do 
    DO k=y_min,y_max+1
      !$omp simd PRIVATE(xFoo) !!!! When removing the simd here, place the private clause in the "omp do"
      DO j=x_min-1,x_max+1
        IF(someCondition)THEN
          xFoo=1
        ELSE
          xFoo=j
        ENDIF
        ! Some more code
        someArray(j,k)=foo(xFoo,k)*bar(j,k)
      ENDDO
    ENDDO

Please note, that I cannot show the real code here in public, but be assured, the code is pretty much doing exactly that. I may send the actual code to Intel though for investigation purposes.

Steven_L_Intel1 · ‎05-14-2014

Without seeing a complete example we can build and run, it's difficult to speculate. Please do send a test case to Intel Premier Support and we'll be glad to investigate.

Grund__Aalexander · ‎05-14-2014

Unfortunately I don't have direct access to the Premier Support. It would have to go through our license person and I wanted to avoid the hassle.

Is there any other way of passing you the example?

Steven_L_Intel1 · ‎05-14-2014

You can use Send Author a Message and attach it there, or send it to me by email at steve.lionel at intel dot com. Best would be a program with a modest sized dataset that doesn't require MIC - reduce the number of variables that can affect performance.

jimdempseyatthecove · ‎05-14-2014

What happens with:

!$omp do  PRIVATE(xFoo)
DO k=y_min,y_max+1
  !$omp simd
  DO j=x_min-1,x_max+1
    IF(someCondition)THEN
      xFoo=1
    ELSE
      xFoo=j
    ENDIF
    ! Some more code
    someArray(j,k)=foo(xFoo,k)*bar(j,k)
  ENDDO
ENDDO

In your original code with the PRIVATE(xFoo) on the simd I am not sure that is proper practice (and the compiler did not object)

.OR. it is proper practice .AND. the compiler did not privatize xFoo (thus causing misuse and cache line evictions)

In either case, placing the PRIVATE(xFoo) on the !$omp do (or on the !$omp parallel, not shown) should not make things worse.

Jim Dempsey

Ron_Green · ‎05-14-2014

The new opt-report and vec-report formats in the Beta 15.0 compiler are far far improved. I would start there first:

https://softwareproductsurvey.intel.com/survey/150347/2afa/

and use the new -opt-report3 output. the reason I say use OPT-REPORT instead of VEC-REPORT is that there is probably more at play here than just vectorization. The vectorizer and the optimizer interact.

It's not clear if SomeCondition is loop variant or loop invariant. This could impact the code generated in the presence of OMP SIMD PRIVATE(xFoo). If it's loop invariant, the conditional may have been hoisted out of the inner loop to the outer loop, a mask vector created and a temp for xFoo created, and used for the vectorized remaining loop

someArray(j,k)=foo(xFoo,k)*bar(j,k)

The presence of PRIVATE may inhibit that hoisting. I don't know without looking at the assembly with -S option OR using the new OPT-REPORT in 15.0 beta. It's also possible the PRIVATE insides the outer loop is causing a new allocation every (y_max - y_min)+1 times, so I can see where that could cause slowdown and moving PRIVATE to the outer parallel region would be more efficient. Again, I'd look at the assembly with -S or use the new beta 15.0 opt-report3 output to see what is going on between the 2 cases.

Just speculation at this point until the actual sources can be analyzed. But it's clear there are many reasons why these 2 variants would generate very different code. The 'loop vectorized' tells you nothing other than SOME KIND of vector code was generated. It has nothing to do with the efficiency of the vectorization. All vectorization is not created equal, and "vectorized" does NOT equal "efficient".

John_Campbell · ‎05-14-2014

What is the probability that "IF(someCondition)THEN" is true or false ?
If it is a very skewed test, it may be worth runing the loop (optimised) with the highly probable outcome, then correcting for the rare condition. This would remove the variable xFoo.

The loop could first be replaced by: someArray(xmin:xmax,k) = foo(xmin:xmax,k)*bar(xmin:xmax,k)

or by: someArray(xmin:xmax,k) = const*bar(xmin:xmax,k)

John

Grund__Aalexander · ‎05-15-2014

Ifort 15.0Beta is not running properly with the sources. So I cannot test it there. (Bug reported via Premier)

Putting the private clause on the omp do DOES fix this behavior although this is wrong IMO.

In the OMP spec it states "private" makes a variable private to a task (in the omp do/parallel case) or to a simd lane (in the simd case)

The variables are only used by the simd lanes so I would expect making them private to the task would introduce more overhead (for creating new private variables) Also: Putting them to the "omp do" ONLY is wrong, according to the spec, although it works with current Intel compilers. They should be shared through the simd lanes but are not.

If there is anything wrong with my understanding of this, please correct me.

The condition is loop variant. It is a condition of the type "array.LT.someConst" and probabilities cannot be determined easily as the array is changing (outside the loop)

PS: I did send the code to Steve Lionel for the purpose of working on this bug.

jimdempseyatthecove · ‎05-15-2014

>>In the OMP spec it states "private" makes a variable private to a task (in the omp do/parallel case) or to a simd lane (in the simd case)

The variables are only used by the simd lanes so I would expect making them private to the task would introduce more overhead (for creating new private variables) Also: Putting them to the "omp do" ONLY is wrong, according to the spec, although it works with current Intel compilers. They should be shared through the simd lanes but are not.

If there is anything wrong with my understanding of this, please correct me.

Let's try to dissect your presumptions:

private on simd directive effectively declares there is no cross-lane dependencies with the local private variable from iteration to iteration, which in turn effectively means no temporal dependencies between iterations. This declaration therefore means the variable can be (hiddenly) defined as a vector (as opposed to as a scalar as represented in the source code).

Now then, the question remains: At which scoping point in the program is the vector expansion of the scalar made?

a) at the point of declaration (and shared amongst the threads of the thread team(s) instantiated within the scope of the definition of the scalar now vector, or

b) a new vector is instantiated at the point of the #pragma simd private(foo).

If a) the vector is shared amongst the threads of the enclosed parallel region *** but with each simd lane private with respect to the other lanes... but shared at that lane position amongst the threads in the enclosed parallel region (catching my breath).

If b) then both the variable and the lanes are private.

The creation of the private variable (when used without copyin or first private) has zero overhead on entering a parallel region (the value of the constant subtracted from the stack pointer differs, the number of operations do not).

IIF(if and only if) there were overhead in creating the simd vector from a scalar, then by moving the private off of the simd and onto the !$omp DO would reduce the number of times the overhead was encountered (from (y_max+1+1-y_min) to number of threads).

Jim Dempsey

John_Campbell · ‎05-15-2014

I would estimate that the IF condition in the inner loop would make the implementation of simd less effective. Strategies to remove the IF would help.

The use of the more likely array syntax in my Quote #7 followed by a DO patch (without simd) may provide a reduced elapsed time. All this is in an OpenMP outer DO loop.

Alternately, any way of grouping the IF may improve the simd performance.

Could you replace xfoo with an index vector for the 2 types of calculation ( foo(xfoo,k) > foo(ifoo(j),k) ), although this might also make the simd calculation non-contiguous.
Or replace foo(xFoo,k) with a vector fook(xmin:xmax) which is calculated from the result of the IF alternative.

John

jimdempseyatthecove · ‎05-16-2014

>>I would estimate that the IF condition in the inner loop would make the implementation of simd less effective.

Not any more. The SIMD instructions have masked move. Pseudo code

xFoo = j
mask = (condition)
xFoo = maskMove(xFoo, 1, mask)

There is no branching involved. You should be able to see this by looking at the disassembly.

** this does require an instruction set with masked move

The instruction sets without masked move can accomplish the same thing in a few more instructions, also without a branch.

Jim Dempsey

Steven_L_Intel1 · ‎05-16-2014

A forum member who wishes to remain anonymous sent me the following:

Just a comment on this forum message regarding OMP speed. A similar issue came up for me just last night.

A colleague was using MKL on a 12-core (2 socket) workstation and asked me to look into why ZAXPY() was only using 2 threads.

I knew why, so I wrote my own OMP loop to explain it to him – y(:)=y(:)+a*x(:) vectorized quite nicely and it takes just 1 thread to saturate the sockets memory bandwidth. That is, the computation isn’t the bottleneck – it’s the memory.

So, MKL knows this and only bothers to use 2 threads (1 per socket), so that other cores are available for other work. You can write your own OMP code to “use” all 12 cores, but the speed will be worse than 2 cores, due to cache/mem channel thrashing.

Didn’t know whether anyone had commented to the user about this, but his loops seems very trivial, and so I would expect them to be memory bound, not CPU bound.

...

Just in case you’re interested, here is info on the STREAM memory benchmark code:

http://www.cs.virginia.edu/stream/

You probably know some of this already, but this benchmark is often used to measure main memory bandwidth. This very simple OMP code is used by many of us to test & validate your new chips – i.e. how fast is that new 4-channel Xeon memory architecture?.

If you look at the “stream.f” source code, it is just simple OMP loops on large arrays (like forum topic). Expressions like a(:) = b(:)+c*d(:).

When vectorized and threaded, that expression would probably require over 150 GB/s of memory speed in order for the RAM to keep up with the CPU. Since the memory channel becomes the bottleneck (e.g. 40GB/s), the execution time can be used to directly calculate the (streaming) memory bandwidth of the CPU/system… Clearly, the use of more cores (threads) simply cannot make the expression compute faster, and often, it becomes slower due to memory contention. (Just a single core looping on an “addps” or “mulps” expression can saturate the memory channel.)

TimP · ‎05-17-2014

Up to this point no one was discussing the multi-core aspect of performance for this case.

Perhaps people may be confused by the move to include vectorization directives (!$omp simd) in the OpenMP standard which previously dealt with threading. The example rightly requests a parallel threaded outer loop and a vectorized inner loop.

In cases of random alignment, MIC frequently requires loop counts on the order of 2000 (both inner and outer loops in a case like this) to approach full performance.

Note that placing omp do on each outer loop is likely to prevent the compiler from fusing the outer loops.

It looks like it's not even necessary to declare xFoo:

someArray(j,k)=foo(merge(1,j,someCondition),k)*bar(j,k)

As hinted before, if the compiler sees someCondition as loop invariant, you would hope it would automatically generate separate loop versions.

It's annoying when !$omp simd changes code generation in a way which degrades performance. That's not peculiar to Intel compilers; I see even more of it with gcc and now gfortran. In ifort, this OpenMP 4 implementation hasn't caught up with the older non-standard !dir$ omp version in some respects.

On the other hand, I appreciate the effort the Intel Fortran compiler team has made to avoid dependence on the simd clause for performance.

An obvious question is how to deal with prefetch. It seems that foo(1,k) should be prefetched before the loop starts (try an explicit prefetch), while foo(j,k) has to be prefetched (preferably automatically) both with initial prefetches of the first cache lines and prefetch inside the loop of the later cache lines.

jimdempseyatthecove · ‎05-17-2014

Tim,

Good point on the fusing of the loops. I missed that.

Depending on the (someCondition) it might be better to hand fuse the two inner loops and enclose both in a single outer loop.

While the compiler might be able to do this, it doesn't hurt to nudge it in the right direction. You would want to look at the vectorization reports, as you would not want to trade-off vectorization against the apparent reduction in writes (actual writes will increase).

Jim Dempsey