@Tim P. Thanks for your

Bobby_G_ · ‎07-15-2016

Hello everybody,

Hopefully you could me explaining this thing. I'm working with Fortran and actually writing my code on CFD topic, and below (just for sake of simplicity and just for example) are short explanations of my case:

I should use one 2D array A(i,j) and one 1D array B(i)
I have to do 2 times looping, which is the first looping should be 50,000 times and the second one is 5 times (can't be changed).
The point number 2 above should be looped 10,000 times.

I write the codes with 2 versions (I called them PROG_A and PROG_B).

The first one PROG_A is as shown below:

PROGRAM PROG_A
 REAL*8, DIMENSION(50000,5):: A
 REAL*8, DIMENSION(50000)::B  
 REAL*8:: TIME1,TIME2
       !Just for initial value
           DO I=1, 50000
              A(I,1)=1.0
              A(I,2)=2.0
              A(I,3)=3.0
              A(I,4)=4.0
              A(I,5)=5.0

              B(I)=I 
           END DO
       !Computing computer's running time (start)
           CALL CPU_TIME(TIME1)
                 DO K=1, 100000
                    DO I=1, 50000             !Array should be computed first for 50,000 elements (can't be changed)
                        DO J=1, 5
                           A(I,J)=A(I,J)+SQRT(B(I))   
                        END DO
                    END DO
                 END DO
       !Computing computer's running time (finish)
           CALL CPU_TIME(TIME2)
             PRINT *, 'Elapsed real time = ', TIME2-TIME1, 'second(s)'
END PROGRAM PROG_A

The second one PROG_B is:

PROGRAM PROG_B
 REAL*8, DIMENSION(5,50000):: A
 REAL*8, DIMENSION(50000)::B  
 REAL*8:: TIME1,TIME2
       !Just for initial value
           DO J=1, 50000
              A(1,J)=1.0
              A(2,J)=2.0
              A(3,J)=3.0
              A(4,J)=4.0
              A(5,J)=5.0

              B(J)=J 
           END DO
       !Computing computer's running time (start)
           CALL CPU_TIME(TIME1)
                 DO K=1, 100000
                    DO J=1, 50000             !Array should be computed first for 50,000 elements (can't be changed)
                        DO I=1, 5
                           A(I,J)=A(I,J)+SQRT(B(J))   
                        END DO
                    END DO
                 END DO
       !Computing computer's running time (finish)
           CALL CPU_TIME(TIME2)
             PRINT *, 'Elapsed real time = ', TIME2-TIME1, 'second(s)'
END PROGRAM PROG_B

------------------------------------------------------------------------------------------------------------------------------------------------------------------

As you can see the different is for the first one I used 2D array A(50000,5) and for the second one I used 2D array A(5,50000). The file names are a.f95 and b.f95 respectively. So, the problems are:

I was trying to compile these programs with Intel Fortran 16.0.3 on Linux by -O3 optimization, but I got the message that ''ld:a.F95: file format not recognized; treating as linker script''. I had tried also by changing the names to a.f90 and b.90 and then I got the message ''
for_main.o: In function `main':
for_main.c:(.text+0x19): undefined reference to `__intel_new_feature_proc_init'
/tmp/ifortgcFmXt.o: In function `MAIN__':
a.F90:(.text+0x17): undefined reference to `__intel_new_feature_proc_init'.
Therefore, i change the compiler with Intel 13.0 and it worked and gave the results: a = 36.322 s and b = 13.800 s.
For the comparison purpose, I was also compiling on Gfortran 5.1.0 ( also with -O3 optimization), I've found that the second one is even much slower than the first one. a = 18.121 s and b = 47.803 s.

Could anyone explain me why? To my knowledge, since Fortran is based on "column major", so the second case would be faster than the first one, since I performed (in the second one) the looping for the most inner side of array (in this case i=1, ..., 5). And could anyone also explain me why these programs were not working when compiled by Intel Fortran 16.0.3?

PS: The results of both cases are same for sure.

Many thanks in advance.

Best wishes,

Bob

TimP · ‎07-15-2016

Gfortran is less likely than ifort to take obvious shortcuts which defeat the apparent purpose of the benchmark. You may need to remove files created by one compiler when switching to another.

Are you looking for the compiler to unroll the inner loop of B into a masked AVX512 operation (requiring alignment instructions such as -align array64byte) ? If the compiler doesn't do this, a feature request with more of a "real life" case may be needed.

In the A case, as Jim explained, loop nest interchange is needed for optimization; ifort may do it automatically, while gfortran may not.

jimdempseyatthecove · ‎07-15-2016

PROG_A, has better vectorization opportunities (for modern compilers)

! possible compiler optimization AVX512 (stride 4 for AVX256)
DO I=1, 50000,8
  DO J=1, 5
    A(I:I+7,J)=A(I:I+7,J)+SQRT(B(I:I+7))   
  END DO
END DO
! *** add remainder code here

.OR.

When compiler not smart enough to vectorize, the scalar version of inner loop, on iterations 2:5 will be fetching data predominantly from the same cache lines of A and B.

Jim Dempsey

jimdempseyatthecove · ‎07-15-2016

I should have mentioned, PROG_A should align arrays A and B at cache line (64 bytes).

PROG_B, if using aligned arrays, would likely vectorize as well, however the vector width would be 5-wide, using unaligned and masked loads and stores. This would be somewhat less efficient: fewer lanes of vector used and lanes will span cache lines.

*** CAUTION ***

This simple test ONLY indicates which arrangement is better for this particular code loop. Your actual use may differ significantly from that of this test program. The determined layout for this test program is not necessarily the best layout for the actual application. (IOW run test on actual code).

Jim Dempsey

Bobby_G_ · ‎07-16-2016

@Tim P. Thanks for your answer. Actually my main intention is pointed more on how to write the code well to produce the CPU's running time as lowest as possible. I'm not looking for the more specific one e.g. the compiler which can do the looping into a masked AVX512 operation. Honestly I don't know too much about it. Actually my first thought was that the true results would be given like Intel Fortran compiler did. Since I was just thinking simply that Fortran always does "column major". So, I always assume that PROG_B will always give the faster CPU's running time (so my assumption was satisfied by Intel Fortran compiler). But, I was a bit shocked, when I was using the Gfortran compiler that even PROG_B was even much slower. So, basically I'm not interested for comparing the Intel Fortran with Gfortran compiler. Based on your answer ("You may need to remove files created by one compiler when switching to another"), should I stick only on a compiler while I'm modifying my code or in other words, the code that I write would be somehow compiler dependent? Thanks in advance.

@jimdempseyatthecove: Thanks for your answers. I understand it well that these are only the simple cases. But, let's assume that my true case is something like that. If so, how could I determine that even using A(50000,5) is even worst or better than using A(5,50000) with the some criteria that looping 50,000 time should be at the most outer part (like I posted previously)? Actually I was also posting my problem on another group, and I was told that on Gfortran the possibility is:

    DO K=1, 100000
       DO I=1, 50000
          tmp = sqrt(b(i))
          A(I,1) = A(I,1) + tmp
          A(I,2) = A(I,2) + tmp
          A(I,3) = A(I,3) + tmp
          A(I,4) = A(I,4) + tmp
          A(I,5) = A(I,5) + tmp
       END DO
    END DO

If so, now this case is totally compiler dependent, isn't it? So, what is your best advice for me since now I'm just confusing if I should continue using this 2D array dimension, or I should be "playing safely" and rethinking to modify it totally into 1D array? Thanks in advance.

All suggestions would be highly appreciated.

Best wishes,

Bob

jimdempseyatthecove · ‎07-16-2016

The code I posted in #3 is pseudo code of what good compiler optimization should do with your prog_a loops.

Your code in #5 should optimize to the same machine code as my #3, but is expressed in actual code as opposed to in a representative code as I presented. The purpose of my format for presentation is such to provide for expressivity that illustrates the otherwise hidden vectorization.

Using your code as a baseline, the representative code format would be:

DO K=1, 100000
   DO I=1, 50000,8
      tmp(1:8) = sqrt(b(I:I+7))
      A(I:I+7,1) = A(I:I+7,1) + tmp
      A(I:I+7,2) = A(I:I+7,2) + tmp
      A(I:I+7,3) = A(I:I+7,3) + tmp
      A(I:I+7,4) = A(I:I+7,4) + tmp
      A(I:I+7,5) = A(I:I+7,5) + tmp
   END DO
END DO
! *** remainder, if any, follows

Note, you do not write the code as above, the compiler should do this for you. The format above illustrates how the compiler should vectorize the code.

Jim Dempsey

How to choose the best configuration of 2D array A(i,j)