topic in Intel® Fortran Compiler

Can this be made better?

a_b_1 — Wed, 02 Apr 2014 07:43:08 GMT

Hello,

I wonder is anyone has the time and inclination to have look at the code below for

any possible improvements.

The extract included here is the the heaviest user of cpu in a large-ish simulation code .

A typical run would take 6-9 months of running 24/24 and 7/7 with 6 threads on six cores.

The omp part is working very well and there cannot be much inprovement with the multithreading part.

The compiler call used for the whole code is:

ifort -O3 -r8 -openmp -fpp -parallel -mcmodel=medium -i-dynamic -shared-intel

Would there be a benefit if part or all of it were written in assembler?

Lots and lots of thanks for any suggestions.

! typical values

! N1 = 768
! N2 = N3 = 12
! M3 = M2 = 42

  Do KEL = 1, N3
  Do JEL = 1, N2

...
[address calculations]

!$OMP  PARALLEL DEFAULT(SHARED) PRIVATE( I, J, K, JA, KA, JJ, KK )
!$OMP DO
   Do K = 1, M3
      Do J = 1, M2
         JJ = (J-1)*NX32

! - copy into work arrays for later fft.

         Do I = 1, N1
            WK_1( JJ+I, K ) = U( J_Jump+J, K_Jump+K, I )
            WK_2( JJ+I, K ) = V( J_Jump+J, K_Jump+K, I )
            WK_3( JJ+I, K ) = W( J_Jump+J, K_Jump+K, I )
         End Do

         Do I = 1, N1, 2
! - du/dx
            WKX_1( JJ+I,   K ) = -Wv(i)*U( J_Jump+J, K_Jump+K, I+1 )
            WKX_1( JJ+I+1, K ) =  Wv(i)*U( J_Jump+J, K_Jump+K, I   )
! - dv/dx
            WKX_2( JJ+I,   K ) = -Wv(i)*V( J_Jump+J, K_Jump+K, I+1 )
            WKX_2( JJ+I+1, K ) =  Wv(i)*V( J_Jump+J, K_Jump+K, I   )
! - dw/dx
            WKX_3( JJ+I,   K ) = -Wv(i)*W( J_Jump+J, K_Jump+K, I+1 )
            WKX_3( JJ+I+1, K ) =  Wv(i)*W( J_Jump+J, K_Jump+K, I   )

	 End Do

! - Y derivatives.

         Do JA =  1, M2
            Do I = 1, N1
               WK_4( JJ+I, K ) = WK_4( JJ+I, K ) + RDY*DYGL(J,JA)*U( J_jump+JA, K_jump+K, I )  ! du/dy
               WK_5( JJ+I, K ) = WK_5( JJ+I, K ) + RDY*DYGL(J,JA)*V( J_jump+JA, K_jump+K, I )  ! dv/dy
               WK_6( JJ+I, K ) = WK_6( JJ+I, K ) + RDY*DYGL(J,JA)*W( J_jump+JA, K_jump+K, I )  ! dw/dy
            End Do
         End Do

! - Z derivatives.

         Do KA = 1, M3
            Do I = 1, N1
               WK_7( JJ+I, K ) = WK_7( JJ+I, K ) + RDZ*DZGL(K,KA)*U( J_jump+J, K_jump+KA, I )   ! du/dz
               WK_8( JJ+I, K ) = WK_8( JJ+I, K ) + RDZ*DZGL(K,KA)*V( J_jump+J, K_jump+KA, I )   ! dv/dz
               WK_9( JJ+I, K ) = WK_9( JJ+I, K ) + RDZ*DZGL(K,KA)*W( J_jump+J, K_jump+KA, I )   ! dw/dz
            End Do
         End Do

      End Do
   End Do   ! eo single element loop.
!$OMP END DO
!$OMP END PARALLEL

...
[other stuff]

end do
end do

I suppose you would require

TimP — Wed, 02 Apr 2014 08:06:00 GMT

I suppose you would require selection of sse4.1 or avx to see vectorization in the opt-report. If compiler doesn't accept unroll and jam directives at O3 you may need to write it explicitly and use an event profiler to see if you are alleviating poor cache behavior.

My best results with unroll and jam came with specified number:

!dir$ unroll_and_jam = 2

You would put the directive ahead of the JA, and KA loops. This should replace the unrolling which the compiler normally does on the inner loop only (which could be eliminated by putting !dir$ unroll(0) there). The compiler would tell you in opt-report what unrolling was actually chosen.

If 2 works, you would need to test whether a larger number (including whatever the compiler might choose) is better for a cut down version of your application. For VTune profiling (with one of the L1 and L2 cache analysis choices), you probably need a case which runs just a few minutes without enabling multiple runs.

Thank you.

a_b_1 — Wed, 02 Apr 2014 13:45:37 GMT

Thank you.

With or without the unroll_and_jam directive the -opt-report gives nothing for

the lines concerned.

Anyway, I found that

!dir$ unroll_and_jam = 4

produces about 4-5% reduction in the cpu for the whole code. Not a bad start.

I will follow the rest of your suggestions and see what comes up.

I am also trying to switch to fftw with weird things happening, but this

is another forum - i know.

Can you reorganize your U,V,W

jimdempseyatthecove — Wed, 02 Apr 2014 17:42:54 GMT

Can you reorganize your U,V,W arrays such that the last index is rearranged to be first?

Alternately, if U,V,W arrays are relatively fixed through long compute sections, then consider shadowing U,V,W with Uotherway, Votherway, Wotherway that the I index first (you can pick different annotation for "otherway"). Only update the shadows when necessary.

Making the above change will improve vectorization opportunities (at the expense of making the copy).

Jim Demspey

a_b_1 — Wed, 02 Apr 2014 21:05:09 GMT

JD,

The i subscript was intentionally put last because the problem

is decoupled with respect to this index for the quite a lot of the code.

So there are a number of places alsewhere where the omp loop is over

the last index i. This makes the whole calculation quite efficient elsewhere

in the code and thus this bit now stands out. I may try some copying into

temporary workarrays to see if things improve.

I tried the -x option of ifort with SSE4.1 and avx with no perceptible

change in the cpu timings.

Thanks.

If -xsse4.1 doesn't trigger

TimP — Wed, 02 Apr 2014 22:25:25 GMT

If -xsse4.1 doesn't trigger vectorization you might try !dir$ simd on those loops. It means the compiler didn't rate vectorization as useful but you due to the large stride operand but you may as well try it.

a_b_1 — Thu, 03 Apr 2014 10:53:00 GMT

I have tested the JD suggestion of carrying two copies of the big arrays

with reverse indexing i.e. I have now both uri( i, j, k ) = u( j, k, i ) etc.

On old Q9650 (4 cores 4 threads) the overall reduction in cpu usage is 8%.

On the newer i7-3090k (6 cores & 6 threads) overall code cpu usage dropped

by 33% (no kidding).

The !dir$ unroll_and_jam directive fiddling gives consistent 4% reduction in cpu.

These are quite good results with an increase of 33% in the usage of

ram (+1.4GB for my current case).

There are few more things to try.

Now, what stands out as the most cpu intensive part is the ffts.

I am trying to switch to fftw but so far this has messed up

completely the whole code.

Thank you.

33% improvement is a good

jimdempseyatthecove — Thu, 03 Apr 2014 15:01:00 GMT

33% improvement is a good return on a little bit of time invested in making this change.

In some cases you will want to organize your data, paying particular attention to the dimension to the dimension index with respect to vectorization. IOW which order will improve performance by improving vectorization.

In other cases part of the code may benefit with one dimension order while a different part of the code benefits best with a different order. When the order is non-flippant (one/either order used multiple times in a row before other order), then I incorporate a flag indicating which array has the most recent representation of the data (0 = a same as b, 1 = a more recent than b, -1 = b more recent than a), . When the most recent copy is not the one I wish to use then a copy is made and flag set to 0 if the procedure is read-only or 1/-1 for modification and depending on which array being modified. This will cut down on unnecessary copy/transformations.

FWIW I also used this technique when an array could reside in CPU memory or GPU memory (or both).

Considering that SIMD vectors are getting wider and wider, more care is require in designing data layouts. Discount scatter/gather as that will only reduce instruction count and not store/fetch cycles. Scatter and gather are beneficial for infrequent accesses of those data vectors with respect to being used in combination with other non-gathered vectors.

You will tend to find that the number of gripes about poor vectorization are solvable by reorganization of data.

Jim Dempsey