Array notation surprisingly slow

leifblaese · ‎09-13-2012

Hi,

i made a unusual discovery today. I wanted to compare the speed of a program using array notation with the speed of a program without it. The program adds two arrays of 1,000,000 integers one thousand times and then reports the result (see below).

[cpp] int main(int argc, char ** argv) {
int array1[1000000];
int array2[1000000];
int a = 1;
for (int i = 0; i < 1000000; i++) {
array1 = a;
array2 = a;
}
for (int j = 0; j < 1000; j++) {
for (int i = 0; i < 1000000; i++) {
array1 += array2; }
}
cout << array1[0]<< endl; [/cpp]

The second program is its array-notation equivalent

[cpp] int main(int argc, char ** argv) {
int array1[1000000];
int array2[1000000];
array1[0:1000000] = a;
array2[0:1000000] = a;
for (int j = 0; j < 1000; j++) {
array1[0:1000000] += array2[0:1000000];
}
cout << array1[0]<< endl;
[/cpp]

Without any optimization upon compilation, the first program (without the array-notation) took about 2.9 seconds and the second program took about 1.9 seconds. If i use optimization level -O3, the first program takes 0.6 seconds and the second program still takes 1.2 second. This is double the time the first program uses.

Obviously, the two for-loops of the first program get vectorized, when using the -O3 option. Still, i would have thought, that the array-notation version was at least as fast. Did i miss something there? Maybe there is a bug somewhere when trying to optimize array-notation-code, so it generates less efficient code.If not - what do have to do to get the array-notation as fast as the "normal" code?

I used a recent version of the cilkplus-branch of the gcc (a few days old) and a Core i7.

leifblaese · ‎09-13-2012

When you exchange the line [cpp] array1[0:1000000] += array2[0:1000000]; [/cpp] with a for-loop that does the exact same thing, you get a vectorized version. It then takes the same time as the version of the program not using array notation and it uses the %xmm-register when looking at the Assembler-Code. So the problem really is with creating SIMD-Code from array-notation inside a for-loop.

leifblaese · ‎09-13-2012

I looked at the assembler-code (compiled it with flags -O3 and -S) and it seems that the second version of my program (array-notation version) is not being vectorized: When adding up the arrays, it does not use the %xmm-Register but the %edx-register. Oddly enough, when setting the values of the array = a (array1[0:100..] =a, array2 also) it does use the xmm0-register. So there seems to be a problem only with vectorizing array notation nested in a for-loop. This is the first time i looked at assembler code so maybe someone else could try to replicate the problem and have a look too.

TimP · ‎09-19-2012

It's hard to guess resulting code in a case like this where there is an opportunity for the compiler to short-cut your code and suppress loops.

leifblaese · ‎09-25-2012

I have the same problem when using array-notation with inline-if statements. Consider the edit-distance of two strings A,B. You can write that as [cpp] char A[100] = { ... }; // i didn't want to write the entire 100 symbols char B[100] = { ... } ; int s; s[0:100] = (A[0:100] == B[0:100]) ? 1: 0; cout << s[0:100] [/cpp] Because of the output at the end, the compiler is forced to do the whole work, without shortcuts or anything. Still, when compiling with -O3 and looking at the assembler code, i get something like that: EDIT: This is the part, where the two symbols are compared. Basically a for-loop that compares byte-for-byte two strings. [cpp] .L2: movzbl 112(%rsp,%rax), %ecx cmpb %cl, (%rsp,%rax) sete %cl addl $1, %edx movzbl %cl, %ecx movl %ecx, 224(%rsp,%rax,4) addq $1, %rax cmpl $99, %edx jle .L2 movl 224(%rsp), %esi movl $_ZSt4cout, %edi call _ZNSolsEi movl 228(%rsp), %esi movl $_ZSt4cout, %edi call _ZNSolsEi movl 232(%rsp), %esi movl $_ZSt4cout, %edi call _ZNSolsEi movl 236(%rsp), %esi movl $_ZSt4cout, %edi call _ZNSolsEi movl 240(%rsp), %esi movl $_ZSt4cout, %edi call _ZNSolsEi xorl %eax, %eax addq $632, %rsp .cfi_def_cfa_offset 8 ret .cfi_endproc [/cpp] The code itself is not that interesting, but note that there is not sign of %xmm registers or any commands that you would expect from SIMD-code (like movdqa or something like that). On the other hand, when you use a "normal" for-loop, you can generate SIMD-Code without a problem. Adding the '#pragma simd' leads to valid SIMD-Code as well, but i think it is because the compiler automatically detects that there is no dependence (because it works without the pragma as well). I don't know whether this is only a problem with the gcc or also with an intel compiler, maybe someone could check that out.