Intel® C++ Compiler
Community support and assistance for creating C++ code that runs on platforms based on Intel® processors.
7944 Discussions

AVX is slower than serial execution ?

zhang_y_1
Beginner
846 Views

I write a simple program and build with icpc to examine the performance of AVX in my mathine. The code snippet is as following,

  #define T 2000000
  #define X 16
  #define Y 16
  #define Z 16

  for(int t=0;t<T;t++)
  for(int k=0;k<Z;k++)
  for(int j=0;j<Y;j++)
  for(int i=0;i<X;i++)
    A=B+C;

The configures are as following,

            icpc version 13.1.0 (gcc version 4.6.1 compatibility)

            FFLAGS="-O3 -xhost "

            Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz

            Red Hat Enterprise Linux Server release 6.3 (Santiago)

The exeperiment result is as following,

niterator 2000000 2000000 200000
size 12*12*12 16*16*16 32*32*32
time  (s)      
serial 1.09918 2.58384 2.99971
avx 1.71405 4.01935 5.18318

 As the table, AVX version always cost more time then serial version.

Can somebody know why?

Thanks in advance!!!

0 Kudos
13 Replies
Bernard
Valued Contributor I
846 Views

Did you check if the AVX code was vectorized?

0 Kudos
zhang_y_1
Beginner
846 Views

Yes, I have read the assemble.

        vmovupd   B(,%rdx,8), %ymm0                             #35.16
        vmovupd   32+B(,%rdx,8), %ymm2                          #35.16
        vmovupd   64+B(,%rdx,8), %ymm4                          #35.16
        vmovupd   96+B(,%rdx,8), %ymm6                          #35.16
        vaddpd    C(,%rdx,8), %ymm0, %ymm1                      #35.27
        vaddpd    32+C(,%rdx,8), %ymm2, %ymm3                   #35.27
        vaddpd    64+C(,%rdx,8), %ymm4, %ymm5                   #35.27
        vaddpd    96+C(,%rdx,8), %ymm6, %ymm7                   #35.27
        vmovupd   %ymm1, A(,%rdx,8)                             #35.5
        vmovupd   %ymm3, 32+A(,%rdx,8)                          #35.5
        vmovupd   %ymm5, 64+A(,%rdx,8)                          #35.5
        vmovupd   %ymm7, 96+A(,%rdx,8)                          #35.5
        addq      $16, %rdx                                     #32.3
        cmpq      $4096, %rdx                                   #32.3
        jb        ..B1.9        # Prob 99%                      #32.3

 

0 Kudos
Bernard
Valued Contributor I
846 Views

Can you post serial version of the code(disassembly)?

 

 

 

 

 

 

0 Kudos
zhang_y_1
Beginner
846 Views

OK, this is the assembly,

..B1.7:
        vmovsd    B(,%rdx,8), %xmm0                             #35.16
        vaddsd    C(,%rdx,8), %xmm0, %xmm1                      #35.27
        vmovsd    %xmm1, A(,%rdx,8)                             #35.5
        incq      %rdx                                          #32.3
        cmpq      $4096, %rdx                                   #32.3
        jb        ..B1.7        # Prob 99%                      #32.3

 

0 Kudos
zhang_y_1
Beginner
846 Views

OK, this is the assembly,

..B1.7:
        vmovsd    B(,%rdx,8), %xmm0                             #35.16
        vaddsd    C(,%rdx,8), %xmm0, %xmm1                      #35.27
        vmovsd    %xmm1, A(,%rdx,8)                             #35.5
        incq      %rdx                                          #32.3
        cmpq      $4096, %rdx                                   #32.3
        jb        ..B1.7        # Prob 99%                      #32.3

 

0 Kudos
TimP
Honored Contributor III
846 Views

Are you hiding all those levels of loops in disassembly and the details of how you tested for a reason? Maybe more shortcuts were taken in the comparison.   Then again, maybe it doesn't matter for a compiler that old.

0 Kudos
Feilong_H_Intel
Employee
846 Views

Hi Zhang,

A small runable test case would be helpful to other users to figure out what was wrong for you.

Thanks.

0 Kudos
Bernard
Valued Contributor I
846 Views

 

Looks like  unrolled 4x  version is executed 500k times and per each outer for - loop T there is 4096 cycles where inner loops are executed.Each such a inner loop cycle consist of 12 AVX instructions.Probably this could be the reason for the slower execution of unrolled version.

0 Kudos
zhang_y_1
Beginner
846 Views

Hi iliyapolak.

I cann`t understand what you said. Would you have a more detail explaintion about that?

0 Kudos
zhang_y_1
Beginner
846 Views

Hi iliyapolak.

I cann`t understand what you said. Would you have a more detail explaintion about that?

0 Kudos
Bernard
Valued Contributor I
846 Views

@zhang

Sorry probably I badly formulated my answer.

I meant that unrolled version executed in total more AVX instruction than serial version.

 

0 Kudos
Bernard
Valued Contributor I
846 Views

Small correction of the post #3.  Outermost loop is not unrolled.By looking at posted disassembly it seems that k and j loops were collapsed or fused into one loop which is unrolled 4x.Probably this unrolling contributed to the worse performance of "vector" version by inserting more machine code instruction.

0 Kudos
jimdempseyatthecove
Honored Contributor III
846 Views

Assure that arrays A, B, C are cache line aligned.

Jim Dempsey

0 Kudos
Reply