Intel® C++ Compiler
Community support and assistance for creating C++ code that runs on platforms based on Intel® processors.

Vectorization with nested loop

Arthur_P_
Beginner
1,258 Views

Hello,

I am running a simple program to test the vectorization optimization of Intel compilers.

I am comparing both C++ and Fortran language for this.

C++ Code (test.cpp):

#include <iostream>
#include <ctime>

int main()
{

    const int Nx=1500, Ny=800, N=5;
    int i,j,t;
    float Q[Nx][Ny], Q0[Nx][Ny], Q1[Nx][Ny], Q2[Nx][Ny], Q3[Nx][Ny];
    float A[Nx][Ny];
    float B[Nx][Ny];
    float iniA, iniB;
    clock_t t1, t2;    

    std::cin >> iniA;
    std::cin >> iniB;

    for (i=0; i<Nx; i++) {
         for (j=0; j<Ny; j++) {
                Q = 0.0f;
		Q0 = 0.0f;
		Q1 = 0.0f;
		Q2 = 0.0f;
		Q3 = 0.0f;
                A = iniA;
                B = iniB;
         }
    }

    t1 = clock();
    for (t=0; t<2000; t++) {

	for (i=0; i<Nx; i++) {
            for (j=0; j<Ny; j++) {
                Q = 2.0f*A + 3.0f*B;
                Q0 = 2.0f*A - 3.0f*B;
		Q1 = 4.0f*A - 3.0f*B;
		Q2 = 8.0f*A + 3.0f*B;
		Q3 = 26.0f*A - 3.0f*B;
            }
        }

    }
    t2 = clock();

    std::cout << "T: " << 1.0f*(t2-t1)/CLOCKS_PER_SEC << std::endl;
    std::cout << "Res: " << Q[0][0] << " " << Q0[0][0] << " " << Q1[0][0] << " " << Q2[0][0] << " " << Q3[0][0] << std::endl;

    return 0;
}

Fortran Code (test.f90):

PROGRAM test

    	integer :: Nx=1500, Ny=800, N=5, i ,j ,k, t
    	REAL, dimension (:,:), allocatable :: Q, Q0, Q1, Q2, Q3, A, B
    	REAL T1, T2, iniA, iniB    

    	READ(*,*) iniA
    	READ(*,*) iniB

    	ALLOCATE(Q(Nx,Ny),Q0(Nx,Ny),Q1(Nx,Ny),Q2(Nx,Ny),Q3(Nx,Ny))
	ALLOCATE(A(Nx,Ny),B(Nx,Ny)) 

        DO j = 1, Ny
            DO i = 1, Nx
                Q(i,j) = 0.0
		Q0(i,j) = 0.0
		Q1(i,j) = 0.0
		Q2(i,j) = 0.0
		Q3(i,j) = 0.0
                A(i,j) = iniA
                B(i,j) = iniB
            ENDDO
	ENDDO

    	CALL CPU_TIME(T1)
    	DO t = 1, 2000

        DO j = 1, Ny
            DO i = 1, Nx
                Q(i,j) = 2.0*A(i,j) + 3.0*B(i,j)
                Q0(i,j) = 2.0*A(i,j) - 3.0*B(i,j)
		Q1(i,j) = 4.0*A(i,j) - 3.0*B(i,j)
		Q2(i,j) = 8.0*A(i,j) + 3.0*B(i,j)
		Q3(i,j) = 26.0*A(i,j) - 3.0*B(i,j)
             ENDDO
	ENDDO

    	ENDDO
    	CALL CPU_TIME(T2)

    	WRITE(*,*) "T: ", 1.0*(T2-T1);
    	WRITE(*,*) "Res: ", Q(1,1),Q0(1,1),Q1(1,1),Q2(1,1),Q3(1,1)

END PROGRAM test

 

These two program are compiled with and without the vectorization using O3 optimization.

icpc -O3 -vec-report2 test.cpp ; icpc -O3 -no-vec test.cpp ; ifort -O3 -vec-report2 test.f90 ; ifort -O3 -no-vec test.cpp

These 4 programs ran on an Intel X7560 and the results are:

Fortran No Vectorization : 9.5s

Fortran Vectorized : 5.8s

C++ No Vectorization : 7.1s

C++ Vectorized : 40.8s 

 

The vectorization in C++ increase the time of computation by 400%. If I look at the vectorization report, I see that the inner loop (l.34) was not vectorized but the outer loop (l.33) is.

test.cpp(18): (col. 5) remark: LOOP WAS VECTORIZED

test.cpp(33): (col. 2) remark: LOOP WAS VECTORIZED

test.cpp(31): (col. 5) remark: loop was not vectorized: not inner loop

I don't understand this automatization, the i-loop is not contiguous in memory! I tried to impose the vectorization using simd flag (#pragma simd) but the outer loop is always vectorized... The problem did not happen using Fortran (the most inner loop was vectorized).

I did not find example with multidimensional array and vectorization on the internet or in the Intel website. I don't even know if it is possible to use in C++ (at least it works in Fortran...). Do you any solution to this problem ?

Thank you

P.S.: I am working on cloud computing (fluid mechanical engineering) and I am trying to optimize my code.

0 Kudos
14 Replies
TimP
Honored Contributor III
1,258 Views

If you were interested in better vector optimization for C++, you wouldn't depend on the compiler switching loop nests to get stride 1 inner loops.

icpc 15.0 makes -ansi-alias a default, while you had to specify that in earlier versions. 

Liberal use of __restrict qualifier also is needed to make C++ comparable to Fortran, although it may not be necessary in this example where the arrays are defined in the same function as the for().  If your definition of C++ excludes that common extension, you're correct that C++ faces a handicap.

0 Kudos
Sukruth_H_Intel
Employee
1,258 Views

Hi Arthur,

                I had gone through the Vec-reports and assembly generation for the above code segment uing the latest 15.0 compiler and these  are some of my observations :-

1. Loops at line number 33 and 34 are merged into one loop. How did I come to this conclusion?

here is the explanation :-

Initially I generated the vec-report and not much information about the merging of these loops, But I get a good amount of info on the alignment though :- 

icl issue2.cpp /Qopt-report /Qopt-report-phase:vec /Qvec-report:6

As you rightly mentioned, in vec-report I see the vectorization report only for outer loop.

LOOP BEGIN at C:\Users\shv\Desktop\issue2.cpp(32,5)
   remark #15542: loop was not vectorized: inner loop was already vectorized

   LOOP BEGIN at C:\Users\shv\Desktop\issue2.cpp(34,2)
      remark #15300: LOOP WAS VECTORIZED
   LOOP END
LOOP END

Later I went into the assembly and this is the code generation :-

L4::            ; optimization report
                ; 1 LOOPS WERE COLLAPSED TO FORM THIS LOOP
                ; LOOP WAS VECTORIZED
                ; VECTORIZATION HAS UNALIGNED MEMORY REFERENCES
                ; VECTORIZATION SPEEDUP COEFFECIENT 3.783203
        movaps    xmm8, xmm1                                    ;36.32
        movaps    xmm5, XMMWORD PTR [24000096+rsp+rdx*4]        ;36.32
        movaps    xmm6, XMMWORD PTR [28800096+rsp+rdx*4]        ;36.47
        mulps     xmm8, xmm5                                    ;36.32
        mulps     xmm6, xmm3                                    ;36.47
        movaps    xmm7, xmm8                                    ;36.47
        subps     xmm8, xmm6                                    ;37.48
        addps     xmm7, xmm6                                    ;36.47
        movntps   XMMWORD PTR [96+rsp+rdx*4], xmm7              ;36.17
        movaps    xmm7, xmm2                                    ;38.19
        movntps   XMMWORD PTR [4800096+rsp+rdx*4], xmm8         ;37.17
        movaps    xmm8, xmm4                                    ;39.19
        mulps     xmm7, xmm5                                    ;38.19
        mulps     xmm8, xmm5                                    ;39.19
        mulps     xmm5, xmm0                                    ;40.20
        subps     xmm7, xmm6                                    ;38.34
        addps     xmm8, xmm6                                    ;39.34
        subps     xmm5, xmm6                                    ;40.35
        movntps   XMMWORD PTR [9600096+rsp+rdx*4], xmm7         ;38.3
        movntps   XMMWORD PTR [14400096+rsp+rdx*4], xmm8        ;39.3
        movntps   XMMWORD PTR [19200096+rsp+rdx*4], xmm5        ;40.3
        add       rdx, 4                                        ;34.2
        cmp       rdx, 1200000                                  ;34.2
        jb        .B1.9         ; Prob 99%       

I see this message in assembly :- "1 LOOPS WERE COLLAPSED TO FORM THIS LOOP", which gave me a hint about the 2 loops been merged. Further if I go down to the see the trip count :-

"cmp       rdx, 1200000"

which is nothing but "Nx * Ny" (1500 * 800).

So this gives the clear picture that the loops have merged. Since now the loops have merged, it reports the vectorization for the merged loop.

Intel(R) Compiler has this option :- "/Qopt-report-embed" which when used along with "/FA" would generate the assembly with the vectorization messages being embed in the assembly file itself.

Note:- I have done this on Windows* platform. options may change in Linux*.

Hope this helps. You may do a similar experimentation on Fortran to see if the loops have been merged or not.

Regards,

Sukruth H V 

0 Kudos
TimP
Honored Contributor III
1,258 Views

Is the point of your test to see to what extent the compiler optimizes away the loop on t?
 

0 Kudos
Sukruth_H_Intel
Employee
1,258 Views

Hi Tim,

            My intention for this test was to let Arthur know why the vectorization reports shows only the outer loops in vec-report being vectorized.

Regards,

Sukruth H V

0 Kudos
Arthur_P_
Beginner
1,258 Views

Thank you Sukruth for this highlight. I understand that the loops are merged during compilation and that the compiler applies vectorization on this big loop but why the computational time in C++ is higher in the vectorized case? In Fortran it works great as it should be.

I tested a simpler program using a single loop calculating c=a+b and it reduce the computational time when activating the vectorization. So why in my program above it doesn't optimize correctly?

Another question for Sukruth, during compilation the loops are merged into a single on, is this merged loop contiguous in memory ? If not, it should be a problem for vectorization, right?

0 Kudos
Sukruth_H_Intel
Employee
1,258 Views

Hi Arthur,

                I tested a simpler program using a single loop calculating c=a+b and it reduce the computational time when activating the vectorization. So why in my program above it doesn't optimize correctly

Did you compare the compilation time between fortran and C++ with this simpler loop and both are same? I would also try to investigate more on this compilation time increase and get back to you on the same.

Another question for Sukruth, during compilation the loops are merged into a single on, is this merged loop contiguous in memory ? If not, it should be a problem for vectorization, right?

Yes, the compiler would make sure the data/memory access is contiguously, because we could see loops are getting vectorized (I mean the merged loop), if the memory access is not contiguous, then vectorizer will wont vectorize the loop unless you force it to by using simd pragma's.

Regards,

Sukruth H V

0 Kudos
Arthur_P_
Beginner
1,258 Views

With the second code, I also compared fortran and c++ and it gives quite the same results:

Fortran No Vectorization : 5.07s

Fortran Vectorized : 3.37s

C++ No Vectorization : 5.05s

C++ Vectorized : 3.33s 

For both of them, the computational time is reduced when activating the vectorization which is normal.

For this new program, I used 1 loop, a simple calculation (a+b) and integer instead of float. Integer uses 16 bit instead of the 32 bit of the float. With X7560 processor, register size is 128 bit which allows us to perform 4 operands per instruction (automatic parallelization).

I still don't understand the problem of my previous program...

Code fortran:

PROGRAM test

 INTEGER :: i,n,input_0, input_1
 REAL :: T1, T2
 INTEGER, DIMENSION(:), ALLOCATABLE :: a, b, c    
 INTEGER :: max = 50000
 
 ALLOCATE(a(max),b(max),c(max))

 READ(*,*) input_0
 READ(*,*) input_1

 DO i=1,max
    a(i) = input_0
    b(i) = input_1
 ENDDO

 CALL CPU_TIME(T1)
 DO n=1,100000

    DO i=1,max 
	c(i) = a(i) + b(i)
    ENDDO

 ENDDO

 CALL cpu_time(T2)

 WRITE(*,*) (T2-T1)

 WRITE(*,*) c(1)

END PROGRAM test

Code C++:

#include <iostream>
#include <ctime>
//#include "omp.h"
#define MAX 50000

using namespace std;

int main() {
    
 int i, n, input_0, input_1;
 int a[MAX], b[MAX], c[MAX];
 clock_t T1,T2;

 cin >> input_0;
 cin >> input_1;
 //input_0 = 1;
 //input_1 = 5;


 for (i=0; i<MAX; i++) {
    a = input_0;
    b = input_1;
 }

 T1 = clock();
 //T1 = omp_get_wtime();
 for(n=0; n<100000; n++) {

    for (i=0; i<MAX; i++) 
        c = a + b;

 }

 T2 = clock();
 //T2 = omp_get_wtime(); 

 cout << (T2-T1)*1.0/CLOCKS_PER_SEC << endl;

 cout << c[0] << endl;

 return 0;
}

 

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,258 Views

>>Integer uses 16 bit instead of the 32 bit of the float.

INTEGER (Fortran) and int (C++) are both 32-bit

Use INTEGER(2) (Fortran) and short (C++) for 16-bit integers.

Jim Dempsey

0 Kudos
Bernard
Valued Contributor I
1,258 Views

I think that inner loop at line #34 was the best candidate for vectorization. Mainly because it size is divisable by four thus allowing load of four float values when not unrolled per iteration on the other hand I am not sure if 1D arrays were scattered  all over memory space. I did not expect from compiler to vectorize the outermost loop. By reading @sukruth-v analysis it seems that compiler decided to linearize 2D array by collapsing loops.

0 Kudos
Sukruth_H_Intel
Employee
1,258 Views

Hi,

     We are discussing with our dev team about the reason for the performance declination. I would update you soon on this.

Regards,
Sukruth H V

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,258 Views

In the post #1

C code used making the y index the stride 1 index

The Fortran code used (x,y) making x the stride 1 index,

While the two loop orders were reversed between C and Fortran, thus making inner loop using stride one, the sizes of the stride 1 index of the array differs.

This makes the inner loop of the C program a multiple of 8 (Ny=800), for the Fortran program the inner loop is not a multiple of 8 (Nx=1500). This (could) require the Fortran program to use unaligned loads.

Not to mention the arrays are transposed in memory.

One of the two should have swapped the indices of the allocation.

Jim Dempsey

0 Kudos
Arthur_P_
Beginner
1,258 Views

Hello,

 

 

I reversed the way the loops were written in the C code and I also defined Nx=Ny=1000 (a factor of 8). The results are:

Fortran No Vectorization : 3.8s

Fortran Vectorized : 2.1s

C++ No Vectorization : 5.33s

C++ Vectorized : 33.6s 

We can see from these results that the fortran is now slightly better than C++ in both vectorized/non-Vectorized compilation. The problem for C++ is still there which was already discussed in the comment above. The nested loops in C are merged into one loop then this big loop (Nx*Ny) is not well vectorized by the intel compiler...

I wanted to use C++ for my CFD program and from previous/current results on optimization, the mathematical libraries available and the way arrays are written in C++, I can confirmed that C++ is not designed for scientific program compared to FORTRAN language. It's too bad because I prefer the way object are handled in C++.

Best regards

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,258 Views

Try using "#pragma nosimd" on the outer loop and "#pragma simd" on the inner loop.

I know you should not have to do this, and reporting the quirk here is valuable to the Intel development team.

The above #pragmas may get you by until a fix is made.

Jim Dempsey

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,258 Views

I happened to think on one other issue. The compiler may be having problems in determining loop invariant code when the 2D array has fixed dimensions. IOW is not an array of pointers. Try the following:

for (t=0; t<2000; t++) {
  for (i=0; i<Nx; i++) {
    float* __restrict Qi = &(Q[0]);
    float* __restrict Q0i = &(Q0[0]);
    float* __restrict Q1i = &(Q1[0]);
    float* __restrict Q2i = &(Q2[0]);
    float* __restrict Q3i = &(Q3[0]);
    float* __restrict Ai = &(A[0]);
    float* __restrict Bi = &(B[0]);
    for (j=0; j<Ny; j++) {
      Qi = 2.0f*Ai + 3.0f*Bi;
      Q0i = 2.0f*Ai - 3.0f*Bi;
      Q1i = 4.0f*Ai - 3.0f*Bi;
      Q2i = 8.0f*Ai + 3.0f*Bi;
      Q3i = 26.0f*Ai - 3.0f*Bi;
     }
  }
}

Jim Dempsey

0 Kudos
Reply