- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
I am running a simple program to test the vectorization optimization of Intel compilers.
I am comparing both C++ and Fortran language for this.
C++ Code (test.cpp):
#include <iostream> #include <ctime> int main() { const int Nx=1500, Ny=800, N=5; int i,j,t; float Q[Nx][Ny], Q0[Nx][Ny], Q1[Nx][Ny], Q2[Nx][Ny], Q3[Nx][Ny]; float A[Nx][Ny]; float B[Nx][Ny]; float iniA, iniB; clock_t t1, t2; std::cin >> iniA; std::cin >> iniB; for (i=0; i<Nx; i++) { for (j=0; j<Ny; j++) { Q= 0.0f; Q0 = 0.0f; Q1 = 0.0f; Q2 = 0.0f; Q3 = 0.0f; A = iniA; B = iniB; } } t1 = clock(); for (t=0; t<2000; t++) { for (i=0; i<Nx; i++) { for (j=0; j<Ny; j++) { Q = 2.0f*A + 3.0f*B ; Q0 = 2.0f*A - 3.0f*B ; Q1 = 4.0f*A - 3.0f*B ; Q2 = 8.0f*A + 3.0f*B ; Q3 = 26.0f*A - 3.0f*B ; } } } t2 = clock(); std::cout << "T: " << 1.0f*(t2-t1)/CLOCKS_PER_SEC << std::endl; std::cout << "Res: " << Q[0][0] << " " << Q0[0][0] << " " << Q1[0][0] << " " << Q2[0][0] << " " << Q3[0][0] << std::endl; return 0; }
Fortran Code (test.f90):
PROGRAM test integer :: Nx=1500, Ny=800, N=5, i ,j ,k, t REAL, dimension (:,:), allocatable :: Q, Q0, Q1, Q2, Q3, A, B REAL T1, T2, iniA, iniB READ(*,*) iniA READ(*,*) iniB ALLOCATE(Q(Nx,Ny),Q0(Nx,Ny),Q1(Nx,Ny),Q2(Nx,Ny),Q3(Nx,Ny)) ALLOCATE(A(Nx,Ny),B(Nx,Ny)) DO j = 1, Ny DO i = 1, Nx Q(i,j) = 0.0 Q0(i,j) = 0.0 Q1(i,j) = 0.0 Q2(i,j) = 0.0 Q3(i,j) = 0.0 A(i,j) = iniA B(i,j) = iniB ENDDO ENDDO CALL CPU_TIME(T1) DO t = 1, 2000 DO j = 1, Ny DO i = 1, Nx Q(i,j) = 2.0*A(i,j) + 3.0*B(i,j) Q0(i,j) = 2.0*A(i,j) - 3.0*B(i,j) Q1(i,j) = 4.0*A(i,j) - 3.0*B(i,j) Q2(i,j) = 8.0*A(i,j) + 3.0*B(i,j) Q3(i,j) = 26.0*A(i,j) - 3.0*B(i,j) ENDDO ENDDO ENDDO CALL CPU_TIME(T2) WRITE(*,*) "T: ", 1.0*(T2-T1); WRITE(*,*) "Res: ", Q(1,1),Q0(1,1),Q1(1,1),Q2(1,1),Q3(1,1) END PROGRAM test
These two program are compiled with and without the vectorization using O3 optimization.
icpc -O3 -vec-report2 test.cpp ; icpc -O3 -no-vec test.cpp ; ifort -O3 -vec-report2 test.f90 ; ifort -O3 -no-vec test.cpp
These 4 programs ran on an Intel X7560 and the results are:
Fortran No Vectorization : 9.5s
Fortran Vectorized : 5.8s
C++ No Vectorization : 7.1s
C++ Vectorized : 40.8s
The vectorization in C++ increase the time of computation by 400%. If I look at the vectorization report, I see that the inner loop (l.34) was not vectorized but the outer loop (l.33) is.
test.cpp(18): (col. 5) remark: LOOP WAS VECTORIZED
test.cpp(33): (col. 2) remark: LOOP WAS VECTORIZED
test.cpp(31): (col. 5) remark: loop was not vectorized: not inner loop
I don't understand this automatization, the i-loop is not contiguous in memory! I tried to impose the vectorization using simd flag (#pragma simd) but the outer loop is always vectorized... The problem did not happen using Fortran (the most inner loop was vectorized).
I did not find example with multidimensional array and vectorization on the internet or in the Intel website. I don't even know if it is possible to use in C++ (at least it works in Fortran...). Do you any solution to this problem ?
Thank you
P.S.: I am working on cloud computing (fluid mechanical engineering) and I am trying to optimize my code.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If you were interested in better vector optimization for C++, you wouldn't depend on the compiler switching loop nests to get stride 1 inner loops.
icpc 15.0 makes -ansi-alias a default, while you had to specify that in earlier versions.
Liberal use of __restrict qualifier also is needed to make C++ comparable to Fortran, although it may not be necessary in this example where the arrays are defined in the same function as the for(). If your definition of C++ excludes that common extension, you're correct that C++ faces a handicap.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Arthur,
I had gone through the Vec-reports and assembly generation for the above code segment uing the latest 15.0 compiler and these are some of my observations :-
1. Loops at line number 33 and 34 are merged into one loop. How did I come to this conclusion?
here is the explanation :-
Initially I generated the vec-report and not much information about the merging of these loops, But I get a good amount of info on the alignment though :-
icl issue2.cpp /Qopt-report /Qopt-report-phase:vec /Qvec-report:6
As you rightly mentioned, in vec-report I see the vectorization report only for outer loop.
LOOP BEGIN at C:\Users\shv\Desktop\issue2.cpp(32,5)
remark #15542: loop was not vectorized: inner loop was already vectorized
LOOP BEGIN at C:\Users\shv\Desktop\issue2.cpp(34,2)
remark #15300: LOOP WAS VECTORIZED
LOOP END
LOOP END
Later I went into the assembly and this is the code generation :-
L4:: ; optimization report ; 1 LOOPS WERE COLLAPSED TO FORM THIS LOOP ; LOOP WAS VECTORIZED ; VECTORIZATION HAS UNALIGNED MEMORY REFERENCES ; VECTORIZATION SPEEDUP COEFFECIENT 3.783203 movaps xmm8, xmm1 ;36.32 movaps xmm5, XMMWORD PTR [24000096+rsp+rdx*4] ;36.32 movaps xmm6, XMMWORD PTR [28800096+rsp+rdx*4] ;36.47 mulps xmm8, xmm5 ;36.32 mulps xmm6, xmm3 ;36.47 movaps xmm7, xmm8 ;36.47 subps xmm8, xmm6 ;37.48 addps xmm7, xmm6 ;36.47 movntps XMMWORD PTR [96+rsp+rdx*4], xmm7 ;36.17 movaps xmm7, xmm2 ;38.19 movntps XMMWORD PTR [4800096+rsp+rdx*4], xmm8 ;37.17 movaps xmm8, xmm4 ;39.19 mulps xmm7, xmm5 ;38.19 mulps xmm8, xmm5 ;39.19 mulps xmm5, xmm0 ;40.20 subps xmm7, xmm6 ;38.34 addps xmm8, xmm6 ;39.34 subps xmm5, xmm6 ;40.35 movntps XMMWORD PTR [9600096+rsp+rdx*4], xmm7 ;38.3 movntps XMMWORD PTR [14400096+rsp+rdx*4], xmm8 ;39.3 movntps XMMWORD PTR [19200096+rsp+rdx*4], xmm5 ;40.3 add rdx, 4 ;34.2 cmp rdx, 1200000 ;34.2 jb .B1.9 ; Prob 99%
I see this message in assembly :- "1 LOOPS WERE COLLAPSED TO FORM THIS LOOP", which gave me a hint about the 2 loops been merged. Further if I go down to the see the trip count :-
"cmp rdx, 1200000"
which is nothing but "Nx * Ny" (1500 * 800).
So this gives the clear picture that the loops have merged. Since now the loops have merged, it reports the vectorization for the merged loop.
Intel(R) Compiler has this option :- "/Qopt-report-embed" which when used along with "/FA" would generate the assembly with the vectorization messages being embed in the assembly file itself.
Note:- I have done this on Windows* platform. options may change in Linux*.
Hope this helps. You may do a similar experimentation on Fortran to see if the loops have been merged or not.
Regards,
Sukruth H V
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Is the point of your test to see to what extent the compiler optimizes away the loop on t?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Tim,
My intention for this test was to let Arthur know why the vectorization reports shows only the outer loops in vec-report being vectorized.
Regards,
Sukruth H V
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you Sukruth for this highlight. I understand that the loops are merged during compilation and that the compiler applies vectorization on this big loop but why the computational time in C++ is higher in the vectorized case? In Fortran it works great as it should be.
I tested a simpler program using a single loop calculating c=a+b and it reduce the computational time when activating the vectorization. So why in my program above it doesn't optimize correctly?
Another question for Sukruth, during compilation the loops are merged into a single on, is this merged loop contiguous in memory ? If not, it should be a problem for vectorization, right?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Arthur,
I tested a simpler program using a single loop calculating c=a+b and it reduce the computational time when activating the vectorization. So why in my program above it doesn't optimize correctly
Did you compare the compilation time between fortran and C++ with this simpler loop and both are same? I would also try to investigate more on this compilation time increase and get back to you on the same.
Another question for Sukruth, during compilation the loops are merged into a single on, is this merged loop contiguous in memory ? If not, it should be a problem for vectorization, right?
Yes, the compiler would make sure the data/memory access is contiguously, because we could see loops are getting vectorized (I mean the merged loop), if the memory access is not contiguous, then vectorizer will wont vectorize the loop unless you force it to by using simd pragma's.
Regards,
Sukruth H V
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
With the second code, I also compared fortran and c++ and it gives quite the same results:
Fortran No Vectorization : 5.07s
Fortran Vectorized : 3.37s
C++ No Vectorization : 5.05s
C++ Vectorized : 3.33s
For both of them, the computational time is reduced when activating the vectorization which is normal.
For this new program, I used 1 loop, a simple calculation (a+b) and integer instead of float. Integer uses 16 bit instead of the 32 bit of the float. With X7560 processor, register size is 128 bit which allows us to perform 4 operands per instruction (automatic parallelization).
I still don't understand the problem of my previous program...
Code fortran:
PROGRAM test INTEGER :: i,n,input_0, input_1 REAL :: T1, T2 INTEGER, DIMENSION(:), ALLOCATABLE :: a, b, c INTEGER :: max = 50000 ALLOCATE(a(max),b(max),c(max)) READ(*,*) input_0 READ(*,*) input_1 DO i=1,max a(i) = input_0 b(i) = input_1 ENDDO CALL CPU_TIME(T1) DO n=1,100000 DO i=1,max c(i) = a(i) + b(i) ENDDO ENDDO CALL cpu_time(T2) WRITE(*,*) (T2-T1) WRITE(*,*) c(1) END PROGRAM test
Code C++:
#include <iostream> #include <ctime> //#include "omp.h" #define MAX 50000 using namespace std; int main() { int i, n, input_0, input_1; int a[MAX], b[MAX], c[MAX]; clock_t T1,T2; cin >> input_0; cin >> input_1; //input_0 = 1; //input_1 = 5; for (i=0; i<MAX; i++) { a = input_0; b = input_1; } T1 = clock(); //T1 = omp_get_wtime(); for(n=0; n<100000; n++) { for (i=0; i<MAX; i++) c = a + b; } T2 = clock(); //T2 = omp_get_wtime(); cout << (T2-T1)*1.0/CLOCKS_PER_SEC << endl; cout << c[0] << endl; return 0; }
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>Integer uses 16 bit instead of the 32 bit of the float.
INTEGER (Fortran) and int (C++) are both 32-bit
Use INTEGER(2) (Fortran) and short (C++) for 16-bit integers.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I think that inner loop at line #34 was the best candidate for vectorization. Mainly because it size is divisable by four thus allowing load of four float values when not unrolled per iteration on the other hand I am not sure if 1D arrays were scattered all over memory space. I did not expect from compiler to vectorize the outermost loop. By reading @sukruth-v analysis it seems that compiler decided to linearize 2D array by collapsing loops.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
We are discussing with our dev team about the reason for the performance declination. I would update you soon on this.
Regards,
Sukruth H V
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In the post #1
C code used
The Fortran code used (x,y) making x the stride 1 index,
While the two loop orders were reversed between C and Fortran, thus making inner loop using stride one, the sizes of the stride 1 index of the array differs.
This makes the inner loop of the C program a multiple of 8 (Ny=800), for the Fortran program the inner loop is not a multiple of 8 (Nx=1500). This (could) require the Fortran program to use unaligned loads.
Not to mention the arrays are transposed in memory.
One of the two should have swapped the indices of the allocation.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
I reversed the way the loops were written in the C code and I also defined Nx=Ny=1000 (a factor of 8). The results are:
Fortran No Vectorization : 3.8s
Fortran Vectorized : 2.1s
C++ No Vectorization : 5.33s
C++ Vectorized : 33.6s
We can see from these results that the fortran is now slightly better than C++ in both vectorized/non-Vectorized compilation. The problem for C++ is still there which was already discussed in the comment above. The nested loops in C are merged into one loop then this big loop (Nx*Ny) is not well vectorized by the intel compiler...
I wanted to use C++ for my CFD program and from previous/current results on optimization, the mathematical libraries available and the way arrays are written in C++, I can confirmed that C++ is not designed for scientific program compared to FORTRAN language. It's too bad because I prefer the way object are handled in C++.
Best regards
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Try using "#pragma nosimd" on the outer loop and "#pragma simd" on the inner loop.
I know you should not have to do this, and reporting the quirk here is valuable to the Intel development team.
The above #pragmas may get you by until a fix is made.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I happened to think on one other issue. The compiler may be having problems in determining loop invariant code when the 2D array has fixed dimensions. IOW is not an array of pointers. Try the following:
for (t=0; t<2000; t++) { for (i=0; i<Nx; i++) { float* __restrict Qi = &(Q[0]); float* __restrict Q0i = &(Q0[0]); float* __restrict Q1i = &(Q1[0]); float* __restrict Q2i = &(Q2[0]); float* __restrict Q3i = &(Q3[0]); float* __restrict Ai = &(A[0]); float* __restrict Bi = &(B[0]); for (j=0; j<Ny; j++) { Qi= 2.0f*Ai + 3.0f*Bi ; Q0i = 2.0f*Ai - 3.0f*Bi ; Q1i = 4.0f*Ai - 3.0f*Bi ; Q2i = 8.0f*Ai + 3.0f*Bi ; Q3i = 26.0f*Ai - 3.0f*Bi ; } } }
Jim Dempsey
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page