Solved: nonstandard loop is not a vectorization candidate.

srimks · ‎01-21-2009

Hi All.

(a)
Applied pragma for below code -

#pragma distribute point
for (j = 0, i = 0; j < MAX; j++) {
apple = 0;
orange = j;
fruits = 0;
}

The above code FOR starts from L#149 and I get below message -

X.cc(149): (col. 5) remark: LOOP WAS VECTORIZED.
X.cc(149): (col. 5) remark: loop was not vectorized: vectorization possible but seems inefficient.

Query: Why it says at first "LOOP WAS VECTORIZED" but again for same L# it says "loop was not vectorized: vectorization possible but seems inefficient" ?

(b)
Also, for below part of code -
--
for (ia=0; (ia AB_out = is_out_grid_info(bc[ia][0], bc[ia][1], bc[ia][2]);
}

I get message as -

X.cc(266): (col. 13) remark: loop was not vectorized: nonstandard loop is not a vectorization candidate.

Query: Not able to interpret this message, specifically "nonstandard loop is not a vectorization candidate"
--

I am using ICC-v11.0 on Linux 64-bit O-Syst./x86_64 for CPP code.

~BR

jimdempseyatthecove · ‎01-28-2009

BR,

Thanks for your confidence in my insights. Remember that these are insights and not specific declarations of fact for a particular processor archetecture.

(i) The movntdq is advantagous under some circumstances and not under others. In the case where one thread wipes a buffer and another thread (not sharing the same cache) uses the buffer for Read or Read/Modify/Write then you avoid depleating the wiping threads cache (and those threads sharing that cache) and then this would be a good use of movntdq. This also has benifits when the wiping thread (or its cache sharing partner) were to use the buffer some time later. However, if the wiping thread (or its cache sharing partner) were to use this buffer immediately then it would be disadvantagous to use movntdq.

(ii) without knowing how/when/who the buffer is used after the wipe/initialization it would be premature to declare use of mobntdq as good or bad. See (i) for additional comments on this.

(iii) What you are (should be) most interested in is best memory bus utilizaiton. High bus utilization is a good thing when the bus is used effectively. Effective meaning memory does not stall waiting for the next write, the fewest writes as possible are performed to complete the task.(a) will/(should be) capable of write combining and thus provide very effictive bus utilization, (b) on the other hand is going to have to defer the movdqa until after the paddd completes. Depending on pipeline in the CPUhardware archetecture this may or may not interfere with the write combining and/or in stalling the memory bus. If this were a problem it could be fixed in the compiler by using 3 more xmm registers and performing all 4 paddd's following all movdqa's.

Jim Dempsey

View solution in original post

TimP · ‎01-22-2009

Quoting - srimks
#pragma distribute point
for (j = 0, i = 0; j < MAX; j++) {
apple = 0;
orange = j;
fruits = 0;
}

X.cc(149): (col. 5) remark: LOOP WAS VECTORIZED.
X.cc(149): (col. 5) remark: loop was not vectorized: vectorization possible but seems inefficient.

(b)
Also, for below part of code -
--
for (ia=0; (ia AB_out = is_out_grid_info(bc[ia][0], bc[ia][1], bc[ia][2]);
}

X.cc(266): (col. 13) remark: loop was not vectorized: nonstandard loop is not a vectorization candidate.

You leave out enough information that uninformed guesses may be off target.
I suppose your first loop is "distributed" (split) into vectorizable (stride 1) and non-vectorizable (I guess variable stride, but you don't show it) loops. #pragma distribute point would have no effect where you put it outside the loop. If you put it inside at the top of the loop so as to stop distribution, presumably you would prevent partial vectorization.
In the second loop, evidently, little can be said without seeing the macro or function. If it is not in-lined, there is no chance for vectorization.

Vladimir_T_Intel · ‎01-22-2009

a) Take a look at generated asm. Most probably two loops were generated. One of them is vectorized.
b) Looks like non-library/non-inlined function call within loop body.

jimdempseyatthecove · ‎01-22-2009

Your fruits=0; is getting in the way. You may have copied this from a Fortran program where the left most index varies fastes, in C++ the right most varies fastest.
Two or three loops might work better.

Jim

srimks · ‎01-22-2009

Quoting - Vladimir Tsymbal (Intel)

a) Take a look at generated asm. Most probably two loops were generated. One of them is vectorized.
b) Looks like non-library/non-inlined function call within loop body.

Hi Vladmir/Tim

The original code has been -
---
#include
#include

#define MAX 1024

int main()
{
int i, j;
int num[MAX], isort[MAX], cluster[MAX][MAX];

for (j = 0; j < MAX; j++) {
num = 0;
isort = j;
for (i = 0; i < MAX; i++) {
cluster = 0;
}
}
printf("%d %d %dn",num[64],isort[78],cluster[384][74]);
return 0;
}
---
which doesn't vectorize with command given "icpc test.cpp -S"

This code when has "pragma distribute point" on outer FOR loop it's asm is -

--
main:
..B1.1: # Preds ..B1.0
..___tag_value_main.1: #7.1
pushq %rbp #7.1
..___tag_value_main.2: #
movq %rsp, %rbp #7.1
..___tag_value_main.3: #
andq $-128, %rsp #7.1
subq $4202496, %rsp #7.1
movl $3, %edi #7.1
..___tag_value_main.5: #7.1
call __intel_new_proc_init #7.1
..___tag_value_main.6: #
# LOE rbx r12 r13 r14 r15
..B1.9: # Preds ..B1.1
stmxcsr (%rsp) #7.1
orl $32832, (%rsp) #7.1
ldmxcsr (%rsp) #7.1
lea 4194304(%rsp), %rdi #13.11
xorl %esi, %esi #13.11
movl $4096, %edx #13.11
call _intel_fast_memset #13.11
# LOE rbx r12 r13 r14 r15
..B1.2: # Preds ..B1.9
movdqa _2il0floatpacket.1(%rip), %xmm1 #12.2
movdqa _2il0floatpacket.2(%rip), %xmm0 #12.2
xorl %eax, %eax #12.2
# LOE rax rbx r12 r13 r14 r15 xmm0 xmm1
..B1.3: # Preds ..B1.3 ..B1.2
movdqa %xmm0, 4198400(%rsp,%rax,4) #14.11
paddd %xmm1, %xmm0 #14.11
movdqa %xmm0, 4198416(%rsp,%rax,4) #14.11
paddd %xmm1, %xmm0 #14.11
movdqa %xmm0, 4198432(%rsp,%rax,4) #14.11
paddd %xmm1, %xmm0 #14.11
movdqa %xmm0, 4198448(%rsp,%rax,4) #14.11
paddd %xmm1, %xmm0 #14.11
addq $16, %rax #12.2
cmpq $1024, %rax #12.2
jl ..B1.3 # Prob 99% #12.2
# LOE rax rbx r12 r13 r14 r15 xmm0 xmm1
..B1.4: # Preds ..B1.3
lea (%rsp), %rdi #17.11
movl $4194304, %edx #17.11
xorl %esi, %esi #17.11
call _intel_fast_memset #17.11
# LOE rbx r12 r13 r14 r15
..B1.5: # Preds ..B1.4
movl 4194560(%rsp), %esi #20.2
movl 4198712(%rsp), %edx #20.2
movl 1573160(%rsp), %ecx #20.2
movl $_2__STRING.0.0, %edi #20.2
movl $_2__STRING.0.0, %edi #20.2
xorl %eax, %eax #20.2
..___tag_value_main.7: #20.2
call printf #20.2
..___tag_value_main.8: #
# LOE rbx r12 r13 r14 r15
..B1.6: # Preds ..B1.5
xorl %eax, %eax #21.9
movq %rbp, %rsp #21.9
popq %rbp #21.9
..___tag_value_main.9: #
ret #21.9
.align 16,0x90
..___tag_value_main.11: #
# LOE
# mark_end;
.type main,@function
.size main,.-main
.data
# -- End main
----

Query
(c) By observing the asm above, how do I know OUTER for LOOP or INNER for LOOP has been PARTIAL vectorized?

(d) I understand "movdqa followed by paddd" which means the data is moved and then packed data are added but apart from these use of VECTOR registers, we only see 4 VECTORIZED move and 4 VECTROIZED add operations in sequence with above asm. these operation is being performed on Quad Core 5300 series m/c. (i.e, Quad Core 5300 Series has 2 die with 4 core on each die which means a total of 8 cores on single node). The query is - Does this VECTORIZATIONS operation of movdqa followed by paddd happens on one core or on 4 cores or on 8 cores of single node?

(e) But when I use "pragma distribute point" only on INNER for LOOP, the asm is same as above, also also with it's size, any insights?

(f) If the above complete FOR loop is modified & with "pragma distribute point" as below -

#pragma distribute point
for (j = 0, i = 0; j < MAX; j++) {
num = 0;
isort = j;
cluster = 0;
}

it is PARTIAL VECTORZED with asm as below -

--
main:
..B1.1: # Preds ..B1.0
..___tag_value_main.1: #7.1
pushq %rbp #7.1
..___tag_value_main.2: #
movq %rsp, %rbp #7.1
..___tag_value_main.3: #
andq $-128, %rsp #7.1
subq $4202496, %rsp #7.1
movl $3, %edi #7.1
..___tag_value_main.5: #7.1
call __intel_new_proc_init #7.1
..___tag_value_main.6: #
# LOE rbx r12 r13 r14 r15
..B1.9: # Preds ..B1.1
stmxcsr (%rsp) #7.1
orl $32832, (%rsp) #7.1
ldmxcsr (%rsp) #7.1
lea 4194304(%rsp), %rdi #13.11
xorl %esi, %esi #13.11
movl $4096, %edx #13.11
call _intel_fast_memset #13.11
# LOE rbx r12 r13 r14 r15
..B1.2: # Preds ..B1.9
movdqa _2il0floatpacket.1(%rip), %xmm1 #12.2
movdqa _2il0floatpacket.2(%rip), %xmm0 #12.2
xorl %eax, %eax #12.2
# LOE rax rbx r12 r13 r14 r15 xmm0 xmm1
..B1.3: # Preds ..B1.3 ..B1.2
movdqa %xmm0, 4198400(%rsp,%rax,4) #14.11
paddd %xmm1, %xmm0 #14.11
movdqa %xmm0, 4198416(%rsp,%rax,4) #14.11
paddd %xmm1, %xmm0 #14.11
movdqa %xmm0, 4198432(%rsp,%rax,4) #14.11
paddd %xmm1, %xmm0 #14.11
movdqa %xmm0, 4198448(%rsp,%rax,4) #14.11
paddd %xmm1, %xmm0 #14.11
addq $16, %rax #12.2
cmpq $1024, %rax #12.2
jl ..B1.3 # Prob 99% #12.2
# LOE rax rbx r12 r13 r14 r15 xmm0 xmm1
..B1.4: # Preds ..B1.3
lea (%rsp), %rdi #17.11
movl $4194304, %edx #17.11
xorl %esi, %esi #17.11
call _intel_fast_memset #17.11
# LOE rbx r12 r13 r14 r15
..B1.5: # Preds ..B1.4
movl 4194560(%rsp), %esi #20.2
movl 4198712(%rsp), %edx #20.2
movl 1573160(%rsp), %ecx #20.2
movl $_2__STRING.0.0, %edi #20.2
xorl %eax, %eax #20.2
..___tag_value_main.7: #20.2
call printf #20.2
..___tag_value_main.8: #
# LOE rbx r12 r13 r14 r15
..B1.6: # Preds ..B1.5
xorl %eax, %eax #21.9
movq %rbp, %rsp #21.9
popq %rbp #21.9
..___tag_value_main.9: #
ret #21.9
.align 16,0x90
..___tag_value_main.11: #
# LOE
# mark_end;
.type main,@function
.size main,.-main
.data
# -- End main
----

How do I justify which format of FOR would be better AUTO-VECTORIZED?

(f) With which "pragma" the original or modified above FOR loop could be completely VECTORIZED?

Thanks for kind support.

~BR

Vladimir_T_Intel · ‎01-23-2009

Hi,

You provide either not correct or not enough information. E.g. the asm listings in your previous post seem the same, although you claim it's been generated over completely different loops. For me personally it's hard to answer.

As for (d) question: vectorization implies data level parallelizm on single core, i.e. using multiple data executed simultaniously in one execution unit. There is nothing to do with multicore here.

(f) Options for better vectorization depends on what "better" means for you. For me it's when the loop body excecutes with minimal number of CPU clocks, which results in better program performance. I'd recommend to take a look at the Compiler User and Reference Guide regarding loops vectorization - you can find there great examples of how simple loops can we vectorized and what particularly prevents compiler from vectorizing loops.

srimks · ‎01-23-2009

Quoting - Vladimir Tsymbal (Intel)

Hi,

You provide either not correct or not enough information. E.g. the asm listings in your previous post seem the same, although you claim it's been generated over completely different loops. For me personally it's hard to answer.

As for (d) question: vectorization implies data level parallelizm on single core, i.e. using multiple data executed simultaniously in one execution unit. There is nothing to do with multicore here.

(f) Options for better vectorization depends on what "better" means for you. For me it's when the loop body excecutes with minimal number of CPU clocks, which results in better program performance. I'd recommend to take a look at the Compiler User and Reference Guide regarding loops vectorization - you can find there great examples of how simple loops can we vectorized and what particularly prevents compiler from vectorizing loops.

Hi Vladmir.

Above code seems to be very small both for original FOR loop and modified FOR loop. I have used ICC-v11.0 on Linux.

Out of curiosity, wish to suggest if you can try at your end and sugest really which LOOP(OUTER or INNER or both) can be vectorized, could you share the asm thereafter.

Would appreciate finally.

~BR

Vladimir_T_Intel · ‎01-23-2009

Which exactly code you wanted me to vectorize and share?

jimdempseyatthecove · ‎01-23-2009

In examining the first example the ASM code and comparing to your C++ code (assuming supplied ASM generated from supplied C++ code) we find

a) The num = 0 was exported from the loop to preceed the loop and performed by _intel_fast_memset, which uses SSE3 instructions (vectorization) good when interation count high, it is.

b) The inner loop zeroing cluster also was exported from the loop and multiplied by the outer loop iteration count and performed by a single call to _intel_fast_memset, which uses SSE3 instructions (vectorization) good when interation count high, it is.

c) This leaves the isort = j and examination of the code shows it was not only vectorized but unrolled 4 times.

Optimization of first example was very good.

Vectorization is limited to a single hardware thread performing small vector operations such as 4 floats, 2 doubles, plus 128-bits worth of bytes, shorts, int_32, int_64 integers.

To spread processing across multiple cores you need to add parallel programming to your applicaiton. Auto parallelization is one method, OpenMP is better and other methods such as TBB.

Your example program will not run faster on multiple cores due to the fact that the loop has virtually no computation and is saturating the memory bus capacity.

Jim Dempsey

srimks · ‎01-27-2009

Quoting - jimdempseyatthecove

In examining the first example the ASM code and comparing to your C++ code (assuming supplied ASM generated from supplied C++ code) we find

a) The num = 0 was exported from the loop to preceed the loop and performed by _intel_fast_memset, which uses SSE3 instructions (vectorization) good when interation count high, it is.

b) The inner loop zeroing cluster also was exported from the loop and multiplied by the outer loop iteration count and performed by a single call to _intel_fast_memset, which uses SSE3 instructions (vectorization) good when interation count high, it is.

c) This leaves the isort = j and examination of the code shows it was not only vectorized but unrolled 4 times.

Optimization of first example was very good.

Vectorization is limited to a single hardware thread performing small vector operations such as 4 floats, 2 doubles, plus 128-bits worth of bytes, shorts, int_32, int_64 integers.

To spread processing across multiple cores you need to add parallel programming to your applicaiton. Auto parallelization is one method, OpenMP is better and other methods such as TBB.

Your example program will not run faster on multiple cores due to the fact that the loop has virtually no computation and is saturating the memory bus capacity.

Jim Dempsey

I think I am using command option as "icpc test.cc -S" which by default with ICC-v11.0 will understand SSE2 generations but not SSE3, please clarify?

You qoute - "(c) This leaves the isort = j and examination of the code shows it was not only vectorized but unrolled 4 times." Is this limitation of unroll or it could be further unroll to 8, 16, etc.?

Above both examples has been taken from a section of code of a file for an applications having multiple files, which performs atleast PARTIAL VECTORIZATION when tested seperately as a test case but when I do the same thing on that section of code of multiple-files, it fails, could be some other dependency which makes it to fail for multiple files. Any suggestions, which can speed up the performances for multiple files. I did add "-O3 - fp-model fast=2 -function-section - fmoit-frame-pointer" to have better speed. This multiple CPP files has too many double & floats datatypes within the code, so thought to use fp-model fast=2 or fp-model strict for having better FP speed.

It's becoming tiresome to perform Auto-vectorization for such a big multiple CPP files. Atleast if above would have worked properly, it could have made way for other CPP to work succesfully.

Any clue?

srimks · ‎01-27-2009

Quoting - Vladimir Tsymbal (Intel)

Which exactly code you wanted me to vectorize and share?

Hello.

I repeat the code which I wish to VECTORIZED is below FOR loop -

---
#include
#include

#define MAX 1024

int main()
{
int i, j;
int num[MAX], isort[MAX], cluster[MAX][MAX];

for (j = 0; j < MAX; j++) {
num = 0;
isort = j;
for (i = 0; i < MAX; i++) {
cluster = 0;
}
}
printf("%d %d %dn",num[64],isort[78],cluster[384][74]);
return 0;
}
---

Looking, if you can provide above as an example for having the best AV.

TX for your time.

~BR

jimdempseyatthecove · ‎01-27-2009

As explained in my earlier post, the num=0 was exported from the loop and fullyvecorized memset to 0 of num[] was performed. The cluster was exported from the loop and a fully vectorized memest to 0 was performed. This left the isort=j which was also vectorized.

I suggest you place a printf or some innocuous function call (time()) in front of the loop, place a break on the innocuous function call, step over the function call, then open a dissassembly window, then step into the fast mem set routine. observe what it is doing. True, this could be inlined. It is relatively easy for you to write your own zero out routine using small vector type variables (the compiler supports these).

Jim

TimP · ‎01-27-2009

I take it you have your own ideas about what constitutes "best AV."
If you wanted the outer loop assignments vectorized, you could arrange that, e.g. with #pragma distribute point ahead of the inner loop, but that would impact performance and code size adversely.
If those arrays aren't aligned automatically, you could improve it with alignment directives. Then, ideally, it would make no difference if you used #pragma vector aligned.

Vladimir_T_Intel · ‎01-27-2009

Jim was right; the first example the asm code is a nicely vectorized one. (Jim explained the code in details, so there is no need to repeat it). But if you make the Compiler a bit more agressive to loops (O3 option), it will combine the isort = j and num = 0 in one loop again in order to use the register set and OOO engine more effectively.

In my "best AV" I had to use only one pragma in the code, while setting /O3 and /QxSSE3 options on my Core2 laptop.

[cpp]#pragma distribute point
	for (j = 0,i=0; j < MAX; j++) {
		num = 0;
		isort = j;
		for (i = 0; i < MAX; i++) {
			cluster = 0;
		}
	}
[/cpp]

As result the compiler generated the following code with two loops:

[cpp].B1.1:                          ; Preds .B1.0
$LN1:
        push      ebp                                           ;13.1
        mov       ebp, esp                                      ;13.1
        and       esp, -128                                     ;13.1
        mov       eax, 4202496                                  ;13.1
        call      __chkstk                                      ;13.1
        push      3                                             ;13.1
        call      ___intel_new_proc_init_P                      ;13.1
                                ; LOE ebx esi edi
.B1.9:                          ; Preds .B1.1
$LN3:
        movdqa    xmm2, XMMWORD PTR [_2il0floatpacket.1]        ;26.2
        movdqa    xmm1, XMMWORD PTR [_2il0floatpacket.2]        ;26.2
$LN5:
        add       esp, 4                                        ;13.1
        xor       eax, eax                                      ;
$LN7:
        pxor      xmm0, xmm0                                    ;27.12
$LN9:
        stmxcsr   DWORD PTR [esp]                               ;13.1
        or        DWORD PTR [esp], 32768                        ;13.1
        ldmxcsr   DWORD PTR [esp]                               ;13.1
                                ; LOE eax ebx esi edi xmm0 xmm1 xmm2
.B1.2:                          ; Preds .B1.2 .B1.9
$LN11:
        movdqa    XMMWORD PTR [esp+eax*4], xmm0                 ;27.3
$LN13:
        movdqa    XMMWORD PTR [4096+esp+eax*4], xmm1            ;28.3
$LN15:
        movdqa    XMMWORD PTR [16+esp+eax*4], xmm0              ;27.3
        movdqa    XMMWORD PTR [32+esp+eax*4], xmm0              ;27.3
        movdqa    XMMWORD PTR [48+esp+eax*4], xmm0              ;27.3
$LN17:
        paddd     xmm1, xmm2                                    ;28.3
        movdqa    XMMWORD PTR [4112+esp+eax*4], xmm1            ;28.3
        paddd     xmm1, xmm2                                    ;28.3
        movdqa    XMMWORD PTR [4128+esp+eax*4], xmm1            ;28.3
        paddd     xmm1, xmm2                                    ;28.3
        movdqa    XMMWORD PTR [4144+esp+eax*4], xmm1            ;28.3
        paddd     xmm1, xmm2                                    ;28.3
$LN19:
        add       eax, 16                                       ;26.2
        cmp       eax, 1024                                     ;26.2
        jb        .B1.2         ; Prob 99%                      ;26.2
                                ; LOE eax ebx esi edi xmm0 xmm1 xmm2
.B1.3:                          ; Preds .B1.2
        xor       eax, eax                                      ;
                                ; LOE eax ebx esi edi xmm0
.B1.4:                          ; Preds .B1.4 .B1.3
$LN21:
        movntdq   XMMWORD PTR [8192+esp+eax*4], xmm0            ;31.4
$LN23:
        add       eax, 4                                        ;26.2
        cmp       eax, 1048576                                  ;26.2
        jb        .B1.4         ; Prob 99%                      ;26.2
                                ; LOE eax ebx esi edi xmm0
.B1.5:                          ; Preds .B1.4
$LN25:
        push      DWORD PTR [1581352+esp]                       ;39.2
        push      DWORD PTR [4412+esp]                          ;39.2
        push      DWORD PTR [264+esp]                           ;39.2
        push      OFFSET FLAT: _2__STRING.1.0.1                 ;39.2
        call      DWORD PTR [__imp__printf]                     ;39.2
                                ; LOE ebx esi edi
.B1.10:                         ; Preds .B1.5
        add       esp, 16                                       ;39.2
                                ; LOE ebx esi edi
.B1.6:                          ; Preds .B1.10
$LN27:
        xor       eax, eax                                      ;40.9
        mov       esp, ebp                                      ;40.9
        pop       ebp                                           ;40.9
        ret                                                     ;40.9
        ALIGN     16
                                ; LOE
; mark_end;
[/cpp]

Here B1.2 represents the loop:

[cpp]	for (j = 0,i=0; j < MAX; j++) {
		num = 0;
		isort = j;
[/cpp]

unrolled with factor 4.
And B1.4 is the non temporal writes of the inner loop (multibied by outer loop)

[cpp]		for (i = 0; i < MAX; i++) {
			cluster = 0;
[/cpp]

- hard to imagine how to make it faster.

srimks · ‎01-27-2009

Quoting - Vladimir Tsymbal (Intel)

Jim was right; the first example the asm code is a nicely vectorized one. (Jim explained the code in details, so there is no need to repeat it). But if you make the Compiler a bit more agressive to loops (O3 option), it will combine the isort = j and num = 0 in one loop again in order to use the register set and OOO engine more effectively.

In my "best AV" I had to use only one pragma in the code, whale setting /O3 and /QxSSE3 options on my Core2 laptop.

[cpp]#pragma distribute point for (j = 0,i=0; j < MAX; j++) { num = 0; isort = j; for (i = 0; i < MAX; i++) { cluster = 0; } } [/cpp]

As result the compiler generated the following code with two loops:

[cpp].B1.1: ; Preds .B1.0 $LN1: push ebp ;13.1 mov ebp, esp ;13.1 and esp, -128 ;13.1 mov eax, 4202496 ;13.1 call __chkstk ;13.1 push 3 ;13.1 call ___intel_new_proc_init_P ;13.1 ; LOE ebx esi edi .B1.9: ; Preds .B1.1 $LN3: movdqa xmm2, XMMWORD PTR [_2il0floatpacket.1] ;26.2 movdqa xmm1, XMMWORD PTR [_2il0floatpacket.2] ;26.2 $LN5: add esp, 4 ;13.1 xor eax, eax ; $LN7: pxor xmm0, xmm0 ;27.12 $LN9: stmxcsr DWORD PTR [esp] ;13.1 or DWORD PTR [esp], 32768 ;13.1 ldmxcsr DWORD PTR [esp] ;13.1 ; LOE eax ebx esi edi xmm0 xmm1 xmm2 .B1.2: ; Preds .B1.2 .B1.9 $LN11: movdqa XMMWORD PTR [esp+eax*4], xmm0 ;27.3 $LN13: movdqa XMMWORD PTR [4096+esp+eax*4], xmm1 ;28.3 $LN15: movdqa XMMWORD PTR [16+esp+eax*4], xmm0 ;27.3 movdqa XMMWORD PTR [32+esp+eax*4], xmm0 ;27.3 movdqa XMMWORD PTR [48+esp+eax*4], xmm0 ;27.3 $LN17: paddd xmm1, xmm2 ;28.3 movdqa XMMWORD PTR [4112+esp+eax*4], xmm1 ;28.3 paddd xmm1, xmm2 ;28.3 movdqa XMMWORD PTR [4128+esp+eax*4], xmm1 ;28.3 paddd xmm1, xmm2 ;28.3 movdqa XMMWORD PTR [4144+esp+eax*4], xmm1 ;28.3 paddd xmm1, xmm2 ;28.3 $LN19: add eax, 16 ;26.2 cmp eax, 1024 ;26.2 jb .B1.2 ; Prob 99% ;26.2 ; LOE eax ebx esi edi xmm0 xmm1 xmm2 .B1.3: ; Preds .B1.2 xor eax, eax ; ; LOE eax ebx esi edi xmm0 .B1.4: ; Preds .B1.4 .B1.3 $LN21: movntdq XMMWORD PTR [8192+esp+eax*4], xmm0 ;31.4 $LN23: add eax, 4 ;26.2 cmp eax, 1048576 ;26.2 jb .B1.4 ; Prob 99% ;26.2 ; LOE eax ebx esi edi xmm0 .B1.5: ; Preds .B1.4 $LN25: push DWORD PTR [1581352+esp] ;39.2 push DWORD PTR [4412+esp] ;39.2 push DWORD PTR [264+esp] ;39.2 push OFFSET FLAT: _2__STRING.1.0.1 ;39.2 call DWORD PTR [__imp__printf] ;39.2 ; LOE ebx esi edi .B1.10: ; Preds .B1.5 add esp, 16 ;39.2 ; LOE ebx esi edi .B1.6: ; Preds .B1.10 $LN27: xor eax, eax ;40.9 mov esp, ebp ;40.9 pop ebp ;40.9 ret ;40.9 ALIGN 16 ; LOE ; mark_end; [/cpp]

Here B1.2 represents the loop:

[cpp] for (j = 0,i=0; j < MAX; j++) { num = 0; isort = j; [/cpp]
unrolled with factor 4.
And B1.4 is the non temporal writes of the inner loop
[cpp] for (i = 0; i < MAX; i++) { cluster = 0; [/cpp]

- hard to imagine how to make it faster.

Hi Vladimir/Jim/Tim.

Thanks for your kind support & time.

~BR

jimdempseyatthecove · ‎01-27-2009

Vladimir,

Apparently the /O3 pulled the out-of-line call to the fast memset to 0 to be inline. Good optimization work by the compiler writers. The only questionable addition to aid speed-up might be to interleave the vectorized isort=j with the zeroing out of the other arrays (small array and front portion of larger array). This way you will not have to wait for the paddd xmm1,xmm2 to complete before scheduling the write. But then depending on the internals of the write to memory scheduler it may be more efficient to not to interleave. Also the .B1.4 loop could probably have been unrolled a bit like the top loop wipe.

Comments on unrolling in general:

If the compiler were to unroll these loops completely the code would run _slower_. The reason for this is the instruction cache would never get re-used and the memory fetches for instructions would slow down the memory writes.

Jim Dempsey

srimks · ‎01-27-2009

Quoting - jimdempseyatthecove

Vladimir,

Apparently the /O3 pulled the out-of-line call to the fast memset to 0 to be inline. Good optimization work by the compiler writers. The only questionable addition to aid speed-up might be to interleave the vectorized isort=j with the zeroing out of the other arrays (small array and front portion of larger array). This way you will not have to wait for the paddd xmm1,xmm2 to complete before scheduling the write. But then depending on the internals of the write to memory scheduler it may be more efficient to not to interleave. Also the .B1.4 loop could probably have been unrolled a bit like the top loop wipe.

Comments on unrolling in general:

If the compiler were to unroll these loops completely the code would run _slower_. The reason for this is the instruction cache would never get re-used and the memory fetches for instructions would slow down the memory writes.

Jim Dempsey

Hello Jim.

Your suggestions are really very helpful. In some of your messages you did suggest "Your example program will not run faster on multiple cores due to the fact that the loop has virtually no computation and is saturating the memory bus capacity."

It's true that BUS UTILIZATION has increased which in return increases the memory latency as one can check the below asm having too many MOV operations.

Could you extend more for above FOR loop - How one can lessen saturation of memory bus capacity as observed by you?

I am really forcing the Compiler with -O3 option, also it seems "fp-model fast=2 or fp-model strict" spoils the performances.

But, if I take the code as orginal one and execute as "icpc -fno-builtin test.cpp -S" we have below obs. -
---
#define MAX 1024

int main()
{
int i, j;
int num[MAX], isort[MAX], cluster[MAX][MAX];

#pragma distribute point
for (j = 0; j < MAX; j++) {
num = 0;
isort = j;
for (i = 0; i < MAX; i++) {
cluster = 0;
}
}
printf("%d %d %dn",num[64],isort[78],cluster[384][74]);
return 0;
---

whose asm is -

--
main:
..B1.1: # Preds ..B1.0
..___tag_value_main.1: #7.1
pushq %rbp #7.1
..___tag_value_main.2: #
movq %rsp, %rbp #7.1
..___tag_value_main.3: #
andq $-128, %rsp #7.1
subq $4202496, %rsp #7.1
movl $3, %edi #7.1
..___tag_value_main.5: #7.1
call __intel_new_proc_init #7.1
..___tag_value_main.6: #
# LOE rbx r12 r13 r14 r15
..B1.11: # Preds ..B1.1
stmxcsr (%rsp) #7.1
orl $32832, (%rsp) #7.1
ldmxcsr (%rsp) #7.1
xorl %eax, %eax #12.2
pxor %xmm0, %xmm0 #13.20
# LOE rax rbx r12 r13 r14 r15 xmm0
..B1.2: # Preds ..B1.2 ..B1.11
movdqa %xmm0, (%rsp,%rax,4) #13.11
movdqa %xmm0, 16(%rsp,%rax,4) #13.11
movdqa %xmm0, 32(%rsp,%rax,4) #13.11
movdqa %xmm0, 48(%rsp,%rax,4) #13.11
addq $16, %rax #12.2
cmpq $1024, %rax #12.2
jl ..B1.2 # Prob 99% #12.2
# LOE rax rbx r12 r13 r14 r15 xmm0
..B1.3: # Preds ..B1.2
movdqa _2il0floatpacket.1(%rip), %xmm2 #12.2
movdqa _2il0floatpacket.2(%rip), %xmm1 #12.2
xorl %eax, %eax #12.2
# LOE rax rbx r12 r13 r14 r15 xmm0 xmm1 xmm2
..B1.4: # Preds ..B1.4 ..B1.3
movdqa %xmm1, 4096(%rsp,%rax,4) #14.11
paddd %xmm2, %xmm1 #14.11
movdqa %xmm1, 4112(%rsp,%rax,4) #14.11
paddd %xmm2, %xmm1 #14.11
movdqa %xmm1, 4128(%rsp,%rax,4) #14.11
paddd %xmm2, %xmm1 #14.11
movdqa %xmm1, 4144(%rsp,%rax,4) #14.11
paddd %xmm2, %xmm1 #14.11
addq $16, %rax #12.2
cmpq $1024, %rax #12.2
jl ..B1.4 # Prob 99% #12.2
# LOE rax rbx r12 r13 r14 r15 xmm0 xmm1 xmm2
..B1.5: # Preds ..B1.4
xorl %eax, %eax #12.2
# LOE rax rbx r12 r13 r14 r15 xmm0
..B1.6: # Preds ..B1.6 ..B1.5
movntdq %xmm0, 8192(%rsp,%rax,4) #16.11
addq $4, %rax #12.2
cmpq $1048576, %rax #12.2
jl ..B1.6 # Prob 99% #12.2
# LOE rax rbx r12 r13 r14 r15 xmm0
..B1.7: # Preds ..B1.6
movl 256(%rsp), %esi #19.2
movl 4408(%rsp), %edx #19.2
movl 1581352(%rsp), %ecx #19.2
movl $_2__STRING.0.0, %edi #19.2
xorl %eax, %eax #19.2
..___tag_value_main.7: #19.2
call printf #19.2
--

Here in this asm, all three iterations(num[ j], cluster & isort ) being PARTIAL VECTORIZED. The respective iterations seems having below behaviour -

(a) num = 0;

..B1.2: # Preds ..B1.2 ..B1.11
movdqa %xmm0, (%rsp,%rax,4) #13.11
movdqa %xmm0, 16(%rsp,%rax,4) #13.11
movdqa %xmm0, 32(%rsp,%rax,4) #13.11
movdqa %xmm0, 48(%rsp,%rax,4) #13.11
addq $16, %rax #12.2
cmpq $1024, %rax #12.2
jl ..B1.2 # Prob 99% #12.2

(b) isort = j;
..B1.4: # Preds ..B1.4 ..B1.3
movdqa %xmm1, 4096(%rsp,%rax,4) #14.11
paddd %xmm2, %xmm1 #14.11
movdqa %xmm1, 4112(%rsp,%rax,4) #14.11
paddd %xmm2, %xmm1 #14.11
movdqa %xmm1, 4128(%rsp,%rax,4) #14.11
paddd %xmm2, %xmm1 #14.11
movdqa %xmm1, 4144(%rsp,%rax,4) #14.11
paddd %xmm2, %xmm1 #14.11
addq $16, %rax #12.2
cmpq $1024, %rax #12.2
jl ..B1.4 # Prob 99% #12.2

(c) cluster = 0;
..B1.6: # Preds ..B1.6 ..B1.5
movntdq %xmm0, 8192(%rsp,%rax,4) #16.11
addq $4, %rax #12.2
cmpq $1048576, %rax #12.2
jl ..B1.6 # Prob 99% #12.2

Query -

(i) If you check (c) asm, we have MOVNTDQ, do you feel that the cache has large collections of data i.e, too large to fit into the cahe?

(ii) Normally, use of MOVNTDQ speeds the code but can use of MOVNTDQ here, does speed-up the performance?)

(iii) Probably, (a) & (b) but most oftenly (b) will result with high BUS UTILIZATION, any suggestions to lessen it.

~BR

jimdempseyatthecove · ‎01-27-2009

The sample program you provided has a fixed number of bytes to write. The only way to relieve memory bus saturation is to accomplish the number of bytes to be written in as few as write cycles as possible. The processor (most new processors) can perfrom write combining. If you can arrange for your arrays to be aligned on 64 or 128 byte boundaries, this and vector writes will reduce memory bus activity. Without alignment you may have the possibility of not having at least 16 byte aligned data. With 64 byte aligned data and using the 4 in a row stores of the SSE3 register memory alignment is favorable. As to the number of memory bus cycles this will take... this would depend on the processor archetecture. In your store of 0, and store of j the fp-model option will not be a factor.

Jim Dempsey

srimks · ‎01-27-2009

Quoting - jimdempseyatthecove

The sample program you provided has a fixed number of bytes to write. The only way to relieve memory bus saturation is to accomplish the number of bytes to be written in as few as write cycles as possible. The processor (most new processors) can perfrom write combining. If you can arrange for your arrays to be aligned on 64 or 128 byte boundaries, this and vector writes will reduce memory bus activity. Without alignment you may have the possibility of not having at least 16 byte aligned data. With 64 byte aligned data and using the 4 in a row stores of the SSE3 register memory alignment is favorable. As to the number of memory bus cycles this will take... this would depend on the processor archetecture. In your store of 0, and store of j the fp-model option will not be a factor.

Jim Dempsey

Jim.

Could you check (i), (ii) & (iii) looking for your valubale insights as earlier.

I am doing all these simply to learn & explore.

TX.

~BR

jimdempseyatthecove · ‎01-28-2009

BR,

Thanks for your confidence in my insights. Remember that these are insights and not specific declarations of fact for a particular processor archetecture.

(i) The movntdq is advantagous under some circumstances and not under others. In the case where one thread wipes a buffer and another thread (not sharing the same cache) uses the buffer for Read or Read/Modify/Write then you avoid depleating the wiping threads cache (and those threads sharing that cache) and then this would be a good use of movntdq. This also has benifits when the wiping thread (or its cache sharing partner) were to use the buffer some time later. However, if the wiping thread (or its cache sharing partner) were to use this buffer immediately then it would be disadvantagous to use movntdq.

(ii) without knowing how/when/who the buffer is used after the wipe/initialization it would be premature to declare use of mobntdq as good or bad. See (i) for additional comments on this.

(iii) What you are (should be) most interested in is best memory bus utilizaiton. High bus utilization is a good thing when the bus is used effectively. Effective meaning memory does not stall waiting for the next write, the fewest writes as possible are performed to complete the task.(a) will/(should be) capable of write combining and thus provide very effictive bus utilization, (b) on the other hand is going to have to defer the movdqa until after the paddd completes. Depending on pipeline in the CPUhardware archetecture this may or may not interfere with the write combining and/or in stalling the memory bus. If this were a problem it could be fixed in the compiler by using 3 more xmm registers and performing all 4 paddd's following all movdqa's.

Jim Dempsey

TimP · ‎01-28-2009

The choice of non-temporal store, when done automatically, is based on an estimate of the data size. It is clearly beneficial to use non-temporal, if the next use will be at the beginning of the data stream, but the stream is so big that the beginning has to be evicted from cache.
Only the latest CPUs perform efficiently enough with backward copy that such a strategy might be considered, so as to leave the end of the stream in cache which will be needed next. I don't know any way to implement it, other than writing the loop with in-line intrinsics.
memmove(), if optimized for the newer CPUs, would use backward move when necessary to deal with source-destination overlap. It also should choose nontemporal according to the stream length, when there is no overlap.