Community
cancel
Showing results for 
Search instead for 
Did you mean: 
zhang_y_1
Beginner
142 Views

AVX is slower than serial execution ?

I write a simple program and build with icpc to examine the performance of AVX in my mathine. The code snippet is as following,

  #define T 2000000
  #define X 16
  #define Y 16
  #define Z 16

  for(int t=0;t<T;t++)
  for(int k=0;k<Z;k++)
  for(int j=0;j<Y;j++)
  for(int i=0;i<X;i++)
    A=B+C;

The configures are as following,

            icpc version 13.1.0 (gcc version 4.6.1 compatibility)

            FFLAGS="-O3 -xhost "

            Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz

            Red Hat Enterprise Linux Server release 6.3 (Santiago)

The exeperiment result is as following,

niterator 2000000 2000000 200000
size 12*12*12 16*16*16 32*32*32
time  (s)      
serial 1.09918 2.58384 2.99971
avx 1.71405 4.01935 5.18318

 As the table, AVX version always cost more time then serial version.

Can somebody know why?

Thanks in advance!!!

0 Kudos
13 Replies
Bernard
Black Belt
142 Views

Did you check if the AVX code was vectorized?

zhang_y_1
Beginner
142 Views

Yes, I have read the assemble.

        vmovupd   B(,%rdx,8), %ymm0                             #35.16
        vmovupd   32+B(,%rdx,8), %ymm2                          #35.16
        vmovupd   64+B(,%rdx,8), %ymm4                          #35.16
        vmovupd   96+B(,%rdx,8), %ymm6                          #35.16
        vaddpd    C(,%rdx,8), %ymm0, %ymm1                      #35.27
        vaddpd    32+C(,%rdx,8), %ymm2, %ymm3                   #35.27
        vaddpd    64+C(,%rdx,8), %ymm4, %ymm5                   #35.27
        vaddpd    96+C(,%rdx,8), %ymm6, %ymm7                   #35.27
        vmovupd   %ymm1, A(,%rdx,8)                             #35.5
        vmovupd   %ymm3, 32+A(,%rdx,8)                          #35.5
        vmovupd   %ymm5, 64+A(,%rdx,8)                          #35.5
        vmovupd   %ymm7, 96+A(,%rdx,8)                          #35.5
        addq      $16, %rdx                                     #32.3
        cmpq      $4096, %rdx                                   #32.3
        jb        ..B1.9        # Prob 99%                      #32.3

 

Bernard
Black Belt
142 Views

Can you post serial version of the code(disassembly)?

 

 

 

 

 

 

zhang_y_1
Beginner
142 Views

OK, this is the assembly,

..B1.7:
        vmovsd    B(,%rdx,8), %xmm0                             #35.16
        vaddsd    C(,%rdx,8), %xmm0, %xmm1                      #35.27
        vmovsd    %xmm1, A(,%rdx,8)                             #35.5
        incq      %rdx                                          #32.3
        cmpq      $4096, %rdx                                   #32.3
        jb        ..B1.7        # Prob 99%                      #32.3

 

zhang_y_1
Beginner
142 Views

OK, this is the assembly,

..B1.7:
        vmovsd    B(,%rdx,8), %xmm0                             #35.16
        vaddsd    C(,%rdx,8), %xmm0, %xmm1                      #35.27
        vmovsd    %xmm1, A(,%rdx,8)                             #35.5
        incq      %rdx                                          #32.3
        cmpq      $4096, %rdx                                   #32.3
        jb        ..B1.7        # Prob 99%                      #32.3

 

TimP
Black Belt
142 Views

Are you hiding all those levels of loops in disassembly and the details of how you tested for a reason? Maybe more shortcuts were taken in the comparison.   Then again, maybe it doesn't matter for a compiler that old.

Feilong_H_Intel
Employee
142 Views

Hi Zhang,

A small runable test case would be helpful to other users to figure out what was wrong for you.

Thanks.

Bernard
Black Belt
142 Views

 

Looks like  unrolled 4x  version is executed 500k times and per each outer for - loop T there is 4096 cycles where inner loops are executed.Each such a inner loop cycle consist of 12 AVX instructions.Probably this could be the reason for the slower execution of unrolled version.

zhang_y_1
Beginner
142 Views

Hi iliyapolak.

I cann`t understand what you said. Would you have a more detail explaintion about that?

zhang_y_1
Beginner
142 Views

Hi iliyapolak.

I cann`t understand what you said. Would you have a more detail explaintion about that?

Bernard
Black Belt
142 Views

@zhang

Sorry probably I badly formulated my answer.

I meant that unrolled version executed in total more AVX instruction than serial version.

 

Bernard
Black Belt
142 Views

Small correction of the post #3.  Outermost loop is not unrolled.By looking at posted disassembly it seems that k and j loops were collapsed or fused into one loop which is unrolled 4x.Probably this unrolling contributed to the worse performance of "vector" version by inserting more machine code instruction.

jimdempseyatthecove
Black Belt
142 Views

Assure that arrays A, B, C are cache line aligned.

Jim Dempsey

Reply