Solved: Vectorization Selection

srimks · ‎01-27-2009

Hi All.

Below is a piece of code taken from a main of multi CPP file -
--

register int j;

for (j = 0; j < 128; j++ ) {
orange = 0;
apple = 0;
}
--

whose objdump for above section is -

--
for (j = 0; j < 128; j++ ) {
4021af: 33 c0 xor %eax,%eax
orange = 0;
4021b1: 66 0f ef c0 pxor %xmm0,%xmm0
4021b5: 66 0f 7f 84 84 50 a0 movdqa %xmm0,0x54a050(%rsp,%rax,4)
4021bc: 54 00
apple = 0;
4021be: 66 0f 7f 84 84 50 c0 movdqa %xmm0,0x54c050(%rsp,%rax,4)
4021c5: 54 00
4021c7: 66 0f 7f 84 84 60 a0 movdqa %xmm0,0x54a060(%rsp,%rax,4)
4021ce: 54 00
4021d0: 66 0f 7f 84 84 60 c0 movdqa %xmm0,0x54c060(%rsp,%rax,4)
4021d7: 54 00
4021d9: 66 0f 7f 84 84 70 a0 movdqa %xmm0,0x54a070(%rsp,%rax,4)
4021e0: 54 00
4021e2: 66 0f 7f 84 84 70 c0 movdqa %xmm0,0x54c070(%rsp,%rax,4)
4021e9: 54 00
4021eb: 66 0f 7f 84 84 80 a0 movdqa %xmm0,0x54a080(%rsp,%rax,4)
4021f2: 54 00
4021f4: 66 0f 7f 84 84 80 c0 movdqa %xmm0,0x54c080(%rsp,%rax,4)
4021fb: 54 00
4021fd: 48 83 c0 10 add $0x10,%rax
402201: 48 3d 00 08 00 00 cmp $0x800,%rax
402207: 7c a8 jl 4021b1

--

but when I include "pragma distribute point" as -
--
register int j;

for (j = 0; j < 128; j++ ) {
orange = 0;
#pragma distribute point
apple = 0;
}
--
the objdump is -
--
for (j = 0; j < 128; j++ ) {
orange = 0;
4021af: 48 8d bc 24 50 44 54 lea 0x544450(%rsp),%rdi
4021b6: 00
4021b7: 33 f6 xor %esi,%esi
4021b9: ba 00 20 00 00 mov $0x2000,%edx
4021be: e8 9d 3d 05 00 callq 455f60 <_intel_fast_memset>
#pragma distribute point
apple = 0;
4021c3: 48 8d bc 24 50 64 54 lea 0x546450(%rsp),%rdi
4021ca: 00
4021cb: 33 f6 xor %esi,%esi
4021cd: ba 00 20 00 00 mov $0x2000,%edx
4021d2: e8 89 3d 05 00 callq 455f60 <_intel_fast_memset>
}
--

Which one should I select and why?

Note: The above section of code is a part of big CPP files which has been compiled using " -O3 -fomit-frame -pointer -function-sections" using ICC-v11.0. The target is to have better performance speed.

~BR

TimP · ‎01-28-2009

Quoting - alpmestan

I think the second one is better.
It calls intel's fast memset and will in my opinion be faster.
But just try it, executing both hmm... a million times ?

Where there is frequent usage of memset, that substitution may improve performance by reducing code size and instruction cach misses. As the loop count is usually not known at compile time, the choice of various branches inside memset may earn its keep. In your case, as the compiler knows the loop count, and can save overhead and reduce code size by setting both arrays in one loop, its default heuristics are probably good.

View solution in original post

alpmestan · ‎01-28-2009

I think the second one is better.
It calls intel's fast memset and will in my opinion be faster.
But just try it, executing both hmm... a million times ?

TimP · ‎01-28-2009

Quoting - alpmestan

I think the second one is better.
It calls intel's fast memset and will in my opinion be faster.
But just try it, executing both hmm... a million times ?

Where there is frequent usage of memset, that substitution may improve performance by reducing code size and instruction cach misses. As the loop count is usually not known at compile time, the choice of various branches inside memset may earn its keep. In your case, as the compiler knows the loop count, and can save overhead and reduce code size by setting both arrays in one loop, its default heuristics are probably good.

jimdempseyatthecove · ‎01-28-2009

I agree with tim18, try both and choose the better. Note, the better route may vary depending on the iteration count and CPU archetecture. For your short iteration count the inline code is likely faster.

Also, you might want to pass the 1st section optimization back to premier support. The loop is not optimal. Notice that the loop is branching back to the xor of the xmm register used for zeroing (the loopcan branch back to instruction following xor). And the add and cmp can be moved back to interleave with the last two movdqa's(adjusting the base constant to accomidate for the difference). Probably could pick up a few clock ticks but maybe not enough to warrant the effort.

Jim Dempsey

dpeterc · ‎01-28-2009

Quoting - srimks

register int j;

for (j = 0; j < 128; j++ ) {
orange = 0;
apple = 0;
}

I am more into C than assembler.
Why wouldn't you write it as

memset(orange, 0, sizeof(orange));
memset(apple, 0, sizeof(apple));

I assume that hitting one address in sequence and then another is faster than doing two in parallel, due to cache locality. I also assume that intel's implementation of memset is faster than generic compiled code.

In the end, I also think that probably your bottleneck is somewhere in calculations, not in initialisation by zero.

TimP · ‎01-28-2009

Quoting - dpeterc

I am more into C than assembler.
Why wouldn't you write it as

memset(orange, 0, sizeof(orange));
memset(apple, 0, sizeof(apple));

I assume that hitting one address in sequence and then another is faster than doing two in parallel, due to cache locality. I also assume that intel's implementation of memset is faster than generic compiled code.

In the end, I also think that probably your bottleneck is somewhere in calculations, not in initialisation by zero.

At least up to 4 streams, storing parallel streams in a single loop is potentially faster. Earlier in the thread, it was demonstrated that the compiler chooses memset only when you tell it to use 1 stream per loop.
One of the bigger advantages of memset(), when the length of the stream isn't known at compile time, is that it will switch to non-temporal store (no cache locality) when the stream is long enough to evict a majority of cache data otherwise. In that case, only the odd remainders at the beginning and end of the stream are stored through cache.
The compiler writers decided there is sufficient advantage to having the compiler make decisions between in-line code and memset. icc, with default options, will use intel memset regardless of whether it decided to make the substitution or you wrote memset in the source yourself.