Hopefully you could me explaining this thing. I'm working with Fortran and actually writing my code on CFD topic, and below (just for sake of simplicity and just for example) are short explanations of my case:
I write the codes with 2 versions (I called them PROG_A and PROG_B).
The first one PROG_A is as shown below:
PROGRAM PROG_A REAL*8, DIMENSION(50000,5):: A REAL*8, DIMENSION(50000)::B REAL*8:: TIME1,TIME2 !Just for initial value DO I=1, 50000 A(I,1)=1.0 A(I,2)=2.0 A(I,3)=3.0 A(I,4)=4.0 A(I,5)=5.0 B(I)=I END DO !Computing computer's running time (start) CALL CPU_TIME(TIME1) DO K=1, 100000 DO I=1, 50000 !Array should be computed first for 50,000 elements (can't be changed) DO J=1, 5 A(I,J)=A(I,J)+SQRT(B(I)) END DO END DO END DO !Computing computer's running time (finish) CALL CPU_TIME(TIME2) PRINT *, 'Elapsed real time = ', TIME2-TIME1, 'second(s)' END PROGRAM PROG_A
The second one PROG_B is:
PROGRAM PROG_B REAL*8, DIMENSION(5,50000):: A REAL*8, DIMENSION(50000)::B REAL*8:: TIME1,TIME2 !Just for initial value DO J=1, 50000 A(1,J)=1.0 A(2,J)=2.0 A(3,J)=3.0 A(4,J)=4.0 A(5,J)=5.0 B(J)=J END DO !Computing computer's running time (start) CALL CPU_TIME(TIME1) DO K=1, 100000 DO J=1, 50000 !Array should be computed first for 50,000 elements (can't be changed) DO I=1, 5 A(I,J)=A(I,J)+SQRT(B(J)) END DO END DO END DO !Computing computer's running time (finish) CALL CPU_TIME(TIME2) PRINT *, 'Elapsed real time = ', TIME2-TIME1, 'second(s)' END PROGRAM PROG_B
As you can see the different is for the first one I used 2D array A(50000,5) and for the second one I used 2D array A(5,50000). The file names are a.f95 and b.f95 respectively. So, the problems are:
for_main.o: In function `main':
for_main.c:(.text+0x19): undefined reference to `__intel_new_feature_proc_init'
/tmp/ifortgcFmXt.o: In function `MAIN__':
a.F90:(.text+0x17): undefined reference to `__intel_new_feature_proc_init'.
Therefore, i change the compiler with Intel 13.0 and it worked and gave the results: a = 36.322 s and b = 13.800 s.
For the comparison purpose, I was also compiling on Gfortran 5.1.0 ( also with -O3 optimization), I've found that the second one is even much slower than the first one. a = 18.121 s and b = 47.803 s.
Could anyone explain me why? To my knowledge, since Fortran is based on "column major", so the second case would be faster than the first one, since I performed (in the second one) the looping for the most inner side of array (in this case i=1, ..., 5). And could anyone also explain me why these programs were not working when compiled by Intel Fortran 16.0.3?
PS: The results of both cases are same for sure.
Many thanks in advance.
Gfortran is less likely than ifort to take obvious shortcuts which defeat the apparent purpose of the benchmark. You may need to remove files created by one compiler when switching to another.
Are you looking for the compiler to unroll the inner loop of B into a masked AVX512 operation (requiring alignment instructions such as -align array64byte) ? If the compiler doesn't do this, a feature request with more of a "real life" case may be needed.
In the A case, as Jim explained, loop nest interchange is needed for optimization; ifort may do it automatically, while gfortran may not.
PROG_A, has better vectorization opportunities (for modern compilers)
! possible compiler optimization AVX512 (stride 4 for AVX256) DO I=1, 50000,8 DO J=1, 5 A(I:I+7,J)=A(I:I+7,J)+SQRT(B(I:I+7)) END DO END DO ! *** add remainder code here
When compiler not smart enough to vectorize, the scalar version of inner loop, on iterations 2:5 will be fetching data predominantly from the same cache lines of A and B.
I should have mentioned, PROG_A should align arrays A and B at cache line (64 bytes).
PROG_B, if using aligned arrays, would likely vectorize as well, however the vector width would be 5-wide, using unaligned and masked loads and stores. This would be somewhat less efficient: fewer lanes of vector used and lanes will span cache lines.
*** CAUTION ***
This simple test ONLY indicates which arrangement is better for this particular code loop. Your actual use may differ significantly from that of this test program. The determined layout for this test program is not necessarily the best layout for the actual application. (IOW run test on actual code).
@Tim P. Thanks for your answer. Actually my main intention is pointed more on how to write the code well to produce the CPU's running time as lowest as possible. I'm not looking for the more specific one e.g. the compiler which can do the looping into a masked AVX512 operation. Honestly I don't know too much about it. Actually my first thought was that the true results would be given like Intel Fortran compiler did. Since I was just thinking simply that Fortran always does "column major". So, I always assume that PROG_B will always give the faster CPU's running time (so my assumption was satisfied by Intel Fortran compiler). But, I was a bit shocked, when I was using the Gfortran compiler that even PROG_B was even much slower. So, basically I'm not interested for comparing the Intel Fortran with Gfortran compiler. Based on your answer ("You may need to remove files created by one compiler when switching to another"), should I stick only on a compiler while I'm modifying my code or in other words, the code that I write would be somehow compiler dependent? Thanks in advance.
@jimdempseyatthecove: Thanks for your answers. I understand it well that these are only the simple cases. But, let's assume that my true case is something like that. If so, how could I determine that even using A(50000,5) is even worst or better than using A(5,50000) with the some criteria that looping 50,000 time should be at the most outer part (like I posted previously)? Actually I was also posting my problem on another group, and I was told that on Gfortran the possibility is:
DO K=1, 100000 DO I=1, 50000 tmp = sqrt(b(i)) A(I,1) = A(I,1) + tmp A(I,2) = A(I,2) + tmp A(I,3) = A(I,3) + tmp A(I,4) = A(I,4) + tmp A(I,5) = A(I,5) + tmp END DO END DO
If so, now this case is totally compiler dependent, isn't it? So, what is your best advice for me since now I'm just confusing if I should continue using this 2D array dimension, or I should be "playing safely" and rethinking to modify it totally into 1D array? Thanks in advance.
All suggestions would be highly appreciated.
The code I posted in #3 is pseudo code of what good compiler optimization should do with your prog_a loops.
Your code in #5 should optimize to the same machine code as my #3, but is expressed in actual code as opposed to in a representative code as I presented. The purpose of my format for presentation is such to provide for expressivity that illustrates the otherwise hidden vectorization.
Using your code as a baseline, the representative code format would be:
DO K=1, 100000 DO I=1, 50000,8 tmp(1:8) = sqrt(b(I:I+7)) A(I:I+7,1) = A(I:I+7,1) + tmp A(I:I+7,2) = A(I:I+7,2) + tmp A(I:I+7,3) = A(I:I+7,3) + tmp A(I:I+7,4) = A(I:I+7,4) + tmp A(I:I+7,5) = A(I:I+7,5) + tmp END DO END DO ! *** remainder, if any, follows
Note, you do not write the code as above, the compiler should do this for you. The format above illustrates how the compiler should vectorize the code.