- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I write a simple program and build with icpc to examine the performance of AVX in my mathine. The code snippet is as following,
#define T 2000000 #define X 16 #define Y 16 #define Z 16 for(int t=0;t<T;t++) for(int k=0;k<Z;k++) for(int j=0;j<Y;j++) for(int i=0;i<X;i++) A=B +C ;
The configures are as following,
icpc version 13.1.0 (gcc version 4.6.1 compatibility)
FFLAGS="-O3 -xhost "
Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
Red Hat Enterprise Linux Server release 6.3 (Santiago)
The exeperiment result is as following,
niterator | 2000000 | 2000000 | 200000 |
size | 12*12*12 | 16*16*16 | 32*32*32 |
time (s) | |||
serial | 1.09918 | 2.58384 | 2.99971 |
avx | 1.71405 | 4.01935 | 5.18318 |
As the table, AVX version always cost more time then serial version.
Can somebody know why?
Thanks in advance!!!
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Did you check if the AVX code was vectorized?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yes, I have read the assemble.
vmovupd B(,%rdx,8), %ymm0 #35.16 vmovupd 32+B(,%rdx,8), %ymm2 #35.16 vmovupd 64+B(,%rdx,8), %ymm4 #35.16 vmovupd 96+B(,%rdx,8), %ymm6 #35.16 vaddpd C(,%rdx,8), %ymm0, %ymm1 #35.27 vaddpd 32+C(,%rdx,8), %ymm2, %ymm3 #35.27 vaddpd 64+C(,%rdx,8), %ymm4, %ymm5 #35.27 vaddpd 96+C(,%rdx,8), %ymm6, %ymm7 #35.27 vmovupd %ymm1, A(,%rdx,8) #35.5 vmovupd %ymm3, 32+A(,%rdx,8) #35.5 vmovupd %ymm5, 64+A(,%rdx,8) #35.5 vmovupd %ymm7, 96+A(,%rdx,8) #35.5 addq $16, %rdx #32.3 cmpq $4096, %rdx #32.3 jb ..B1.9 # Prob 99% #32.3
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Can you post serial version of the code(disassembly)?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
OK, this is the assembly,
..B1.7: vmovsd B(,%rdx,8), %xmm0 #35.16 vaddsd C(,%rdx,8), %xmm0, %xmm1 #35.27 vmovsd %xmm1, A(,%rdx,8) #35.5 incq %rdx #32.3 cmpq $4096, %rdx #32.3 jb ..B1.7 # Prob 99% #32.3
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
OK, this is the assembly,
..B1.7: vmovsd B(,%rdx,8), %xmm0 #35.16 vaddsd C(,%rdx,8), %xmm0, %xmm1 #35.27 vmovsd %xmm1, A(,%rdx,8) #35.5 incq %rdx #32.3 cmpq $4096, %rdx #32.3 jb ..B1.7 # Prob 99% #32.3
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Are you hiding all those levels of loops in disassembly and the details of how you tested for a reason? Maybe more shortcuts were taken in the comparison. Then again, maybe it doesn't matter for a compiler that old.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Zhang,
A small runable test case would be helpful to other users to figure out what was wrong for you.
Thanks.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Looks like unrolled 4x version is executed 500k times and per each outer for - loop T there is 4096 cycles where inner loops are executed.Each such a inner loop cycle consist of 12 AVX instructions.Probably this could be the reason for the slower execution of unrolled version.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi iliyapolak.
I cann`t understand what you said. Would you have a more detail explaintion about that?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi iliyapolak.
I cann`t understand what you said. Would you have a more detail explaintion about that?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@zhang
Sorry probably I badly formulated my answer.
I meant that unrolled version executed in total more AVX instruction than serial version.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Small correction of the post #3. Outermost loop is not unrolled.By looking at posted disassembly it seems that k and j loops were collapsed or fused into one loop which is unrolled 4x.Probably this unrolling contributed to the worse performance of "vector" version by inserting more machine code instruction.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Assure that arrays A, B, C are cache line aligned.
Jim Dempsey
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page