- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
(a)
Applied pragma for below code -
#pragma distribute point
for (j = 0, i = 0; j < MAX; j++) {
apple
orange
fruits
}
The above code FOR starts from L#149 and I get below message -
X.cc(149): (col. 5) remark: LOOP WAS VECTORIZED.
X.cc(149): (col. 5) remark: loop was not vectorized: vectorization possible but seems inefficient.
Query: Why it says at first "LOOP WAS VECTORIZED" but again for same L# it says "loop was not vectorized: vectorization possible but seems inefficient" ?
(b)
Also, for below part of code -
--
for (ia=0; (ia
}
I get message as -
X.cc(266): (col. 13) remark: loop was not vectorized: nonstandard loop is not a vectorization candidate.
Query: Not able to interpret this message, specifically "nonstandard loop is not a vectorization candidate"
--
I am using ICC-v11.0 on Linux 64-bit O-Syst./x86_64 for CPP code.
~BR
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
BR,
Thanks for your confidence in my insights. Remember that these are insights and not specific declarations of fact for a particular processor archetecture.
(i) The movntdq is advantagous under some circumstances and not under others. In the case where one thread wipes a buffer and another thread (not sharing the same cache) uses the buffer for Read or Read/Modify/Write then you avoid depleating the wiping threads cache (and those threads sharing that cache) and then this would be a good use of movntdq. This also has benifits when the wiping thread (or its cache sharing partner) were to use the buffer some time later. However, if the wiping thread (or its cache sharing partner) were to use this buffer immediately then it would be disadvantagous to use movntdq.
(ii) without knowing how/when/who the buffer is used after the wipe/initialization it would be premature to declare use of mobntdq as good or bad. See (i) for additional comments on this.
(iii) What you are (should be) most interested in is best memory bus utilizaiton. High bus utilization is a good thing when the bus is used effectively. Effective meaning memory does not stall waiting for the next write, the fewest writes as possible are performed to complete the task.(a) will/(should be) capable of write combining and thus provide very effictive bus utilization, (b) on the other hand is going to have to defer the movdqa until after the paddd completes. Depending on pipeline in the CPUhardware archetecture this may or may not interfere with the write combining and/or in stalling the memory bus. If this were a problem it could be fixed in the compiler by using 3 more xmm registers and performing all 4 paddd's following all movdqa's.
Jim Dempsey
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
#pragma distribute point
for (j = 0, i = 0; j < MAX; j++) {
apple
orange
fruits
}
X.cc(149): (col. 5) remark: LOOP WAS VECTORIZED.
X.cc(149): (col. 5) remark: loop was not vectorized: vectorization possible but seems inefficient.
(b)
Also, for below part of code -
--
for (ia=0; (ia
}
X.cc(266): (col. 13) remark: loop was not vectorized: nonstandard loop is not a vectorization candidate.
I suppose your first loop is "distributed" (split) into vectorizable (stride 1) and non-vectorizable (I guess variable stride, but you don't show it) loops. #pragma distribute point would have no effect where you put it outside the loop. If you put it inside at the top of the loop so as to stop distribution, presumably you would prevent partial vectorization.
In the second loop, evidently, little can be said without seeing the macro or function. If it is not in-lined, there is no chance for vectorization.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
b) Looks like non-library/non-inlined function call within loop body.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Your fruits
Two or three loops might work better.
Jim
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
b) Looks like non-library/non-inlined function call within loop body.
The original code has been -
---
#include
#include
#define MAX 1024
int main()
{
int i, j;
int num[MAX], isort[MAX], cluster[MAX][MAX];
for (j = 0; j < MAX; j++) {
num
isort
for (i = 0; i < MAX; i++) {
cluster
}
}
printf("%d %d %dn",num[64],isort[78],cluster[384][74]);
return 0;
}
---
which doesn't vectorize with command given "icpc test.cpp -S"
This code when has "pragma distribute point" on outer FOR loop it's asm is -
--
main:
..B1.1: # Preds ..B1.0
..___tag_value_main.1: #7.1
pushq %rbp #7.1
..___tag_value_main.2: #
movq %rsp, %rbp #7.1
..___tag_value_main.3: #
andq $-128, %rsp #7.1
subq $4202496, %rsp #7.1
movl $3, %edi #7.1
..___tag_value_main.5: #7.1
call __intel_new_proc_init #7.1
..___tag_value_main.6: #
# LOE rbx r12 r13 r14 r15
..B1.9: # Preds ..B1.1
stmxcsr (%rsp) #7.1
orl $32832, (%rsp) #7.1
ldmxcsr (%rsp) #7.1
lea 4194304(%rsp), %rdi #13.11
xorl %esi, %esi #13.11
movl $4096, %edx #13.11
call _intel_fast_memset #13.11
# LOE rbx r12 r13 r14 r15
..B1.2: # Preds ..B1.9
movdqa _2il0floatpacket.1(%rip), %xmm1 #12.2
movdqa _2il0floatpacket.2(%rip), %xmm0 #12.2
xorl %eax, %eax #12.2
# LOE rax rbx r12 r13 r14 r15 xmm0 xmm1
..B1.3: # Preds ..B1.3 ..B1.2
movdqa %xmm0, 4198400(%rsp,%rax,4) #14.11
paddd %xmm1, %xmm0 #14.11
movdqa %xmm0, 4198416(%rsp,%rax,4) #14.11
paddd %xmm1, %xmm0 #14.11
movdqa %xmm0, 4198432(%rsp,%rax,4) #14.11
paddd %xmm1, %xmm0 #14.11
movdqa %xmm0, 4198448(%rsp,%rax,4) #14.11
paddd %xmm1, %xmm0 #14.11
addq $16, %rax #12.2
cmpq $1024, %rax #12.2
jl ..B1.3 # Prob 99% #12.2
# LOE rax rbx r12 r13 r14 r15 xmm0 xmm1
..B1.4: # Preds ..B1.3
lea (%rsp), %rdi #17.11
movl $4194304, %edx #17.11
xorl %esi, %esi #17.11
call _intel_fast_memset #17.11
# LOE rbx r12 r13 r14 r15
..B1.5: # Preds ..B1.4
movl 4194560(%rsp), %esi #20.2
movl 4198712(%rsp), %edx #20.2
movl 1573160(%rsp), %ecx #20.2
movl $_2__STRING.0.0, %edi #20.2
movl $_2__STRING.0.0, %edi #20.2
xorl %eax, %eax #20.2
..___tag_value_main.7: #20.2
call printf #20.2
..___tag_value_main.8: #
# LOE rbx r12 r13 r14 r15
..B1.6: # Preds ..B1.5
xorl %eax, %eax #21.9
movq %rbp, %rsp #21.9
popq %rbp #21.9
..___tag_value_main.9: #
ret #21.9
.align 16,0x90
..___tag_value_main.11: #
# LOE
# mark_end;
.type main,@function
.size main,.-main
.data
# -- End main
----
Query
(c) By observing the asm above, how do I know OUTER for LOOP or INNER for LOOP has been PARTIAL vectorized?
(d) I understand "movdqa followed by paddd" which means the data is moved and then packed data are added but apart from these use of VECTOR registers, we only see 4 VECTORIZED move and 4 VECTROIZED add operations in sequence with above asm. these operation is being performed on Quad Core 5300 series m/c. (i.e, Quad Core 5300 Series has 2 die with 4 core on each die which means a total of 8 cores on single node). The query is - Does this VECTORIZATIONS operation of movdqa followed by paddd happens on one core or on 4 cores or on 8 cores of single node?
(e) But when I use "pragma distribute point" only on INNER for LOOP, the asm is same as above, also also with it's size, any insights?
(f) If the above complete FOR loop is modified & with "pragma distribute point" as below -
#pragma distribute point
for (j = 0, i = 0; j < MAX; j++) {
num
isort
cluster
}
it is PARTIAL VECTORZED with asm as below -
--
main:
..B1.1: # Preds ..B1.0
..___tag_value_main.1: #7.1
pushq %rbp #7.1
..___tag_value_main.2: #
movq %rsp, %rbp #7.1
..___tag_value_main.3: #
andq $-128, %rsp #7.1
subq $4202496, %rsp #7.1
movl $3, %edi #7.1
..___tag_value_main.5: #7.1
call __intel_new_proc_init #7.1
..___tag_value_main.6: #
# LOE rbx r12 r13 r14 r15
..B1.9: # Preds ..B1.1
stmxcsr (%rsp) #7.1
orl $32832, (%rsp) #7.1
ldmxcsr (%rsp) #7.1
lea 4194304(%rsp), %rdi #13.11
xorl %esi, %esi #13.11
movl $4096, %edx #13.11
call _intel_fast_memset #13.11
# LOE rbx r12 r13 r14 r15
..B1.2: # Preds ..B1.9
movdqa _2il0floatpacket.1(%rip), %xmm1 #12.2
movdqa _2il0floatpacket.2(%rip), %xmm0 #12.2
xorl %eax, %eax #12.2
# LOE rax rbx r12 r13 r14 r15 xmm0 xmm1
..B1.3: # Preds ..B1.3 ..B1.2
movdqa %xmm0, 4198400(%rsp,%rax,4) #14.11
paddd %xmm1, %xmm0 #14.11
movdqa %xmm0, 4198416(%rsp,%rax,4) #14.11
paddd %xmm1, %xmm0 #14.11
movdqa %xmm0, 4198432(%rsp,%rax,4) #14.11
paddd %xmm1, %xmm0 #14.11
movdqa %xmm0, 4198448(%rsp,%rax,4) #14.11
paddd %xmm1, %xmm0 #14.11
addq $16, %rax #12.2
cmpq $1024, %rax #12.2
jl ..B1.3 # Prob 99% #12.2
# LOE rax rbx r12 r13 r14 r15 xmm0 xmm1
..B1.4: # Preds ..B1.3
lea (%rsp), %rdi #17.11
movl $4194304, %edx #17.11
xorl %esi, %esi #17.11
call _intel_fast_memset #17.11
# LOE rbx r12 r13 r14 r15
..B1.5: # Preds ..B1.4
movl 4194560(%rsp), %esi #20.2
movl 4198712(%rsp), %edx #20.2
movl 1573160(%rsp), %ecx #20.2
movl $_2__STRING.0.0, %edi #20.2
xorl %eax, %eax #20.2
..___tag_value_main.7: #20.2
call printf #20.2
..___tag_value_main.8: #
# LOE rbx r12 r13 r14 r15
..B1.6: # Preds ..B1.5
xorl %eax, %eax #21.9
movq %rbp, %rsp #21.9
popq %rbp #21.9
..___tag_value_main.9: #
ret #21.9
.align 16,0x90
..___tag_value_main.11: #
# LOE
# mark_end;
.type main,@function
.size main,.-main
.data
# -- End main
----
How do I justify which format of FOR would be better AUTO-VECTORIZED?
(f) With which "pragma" the original or modified above FOR loop could be completely VECTORIZED?
Thanks for kind support.
~BR
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
You provide either not correct or not enough information. E.g. the asm listings in your previous post seem the same, although you claim it's been generated over completely different loops. For me personally it's hard to answer.
As for (d) question: vectorization implies data level parallelizm on single core, i.e. using multiple data executed simultaniously in one execution unit. There is nothing to do with multicore here.
(f) Options for better vectorization depends on what "better" means for you. For me it's when the loop body excecutes with minimal number of CPU clocks, which results in better program performance. I'd recommend to take a look at the Compiler User and Reference Guide regarding loops vectorization - you can find there great examples of how simple loops can we vectorized and what particularly prevents compiler from vectorizing loops.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
You provide either not correct or not enough information. E.g. the asm listings in your previous post seem the same, although you claim it's been generated over completely different loops. For me personally it's hard to answer.
As for (d) question: vectorization implies data level parallelizm on single core, i.e. using multiple data executed simultaniously in one execution unit. There is nothing to do with multicore here.
(f) Options for better vectorization depends on what "better" means for you. For me it's when the loop body excecutes with minimal number of CPU clocks, which results in better program performance. I'd recommend to take a look at the Compiler User and Reference Guide regarding loops vectorization - you can find there great examples of how simple loops can we vectorized and what particularly prevents compiler from vectorizing loops.
Above code seems to be very small both for original FOR loop and modified FOR loop. I have used ICC-v11.0 on Linux.
Out of curiosity, wish to suggest if you can try at your end and sugest really which LOOP(OUTER or INNER or both) can be vectorized, could you share the asm thereafter.
Would appreciate finally.
~BR
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In examining the first example the ASM code and comparing to your C++ code (assuming supplied ASM generated from supplied C++ code) we find
a) The num
b) The inner loop zeroing cluster also was exported from the loop and multiplied by the outer loop iteration count and performed by a single call to _intel_fast_memset, which uses SSE3 instructions (vectorization) good when interation count high, it is.
c) This leaves the isort
Optimization of first example was very good.
Vectorization is limited to a single hardware thread performing small vector operations such as 4 floats, 2 doubles, plus 128-bits worth of bytes, shorts, int_32, int_64 integers.
To spread processing across multiple cores you need to add parallel programming to your applicaiton. Auto parallelization is one method, OpenMP is better and other methods such as TBB.
Your example program will not run faster on multiple cores due to the fact that the loop has virtually no computation and is saturating the memory bus capacity.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In examining the first example the ASM code and comparing to your C++ code (assuming supplied ASM generated from supplied C++ code) we find
a) The num
b) The inner loop zeroing cluster also was exported from the loop and multiplied by the outer loop iteration count and performed by a single call to _intel_fast_memset, which uses SSE3 instructions (vectorization) good when interation count high, it is.
c) This leaves the isort
Optimization of first example was very good.
Vectorization is limited to a single hardware thread performing small vector operations such as 4 floats, 2 doubles, plus 128-bits worth of bytes, shorts, int_32, int_64 integers.
To spread processing across multiple cores you need to add parallel programming to your applicaiton. Auto parallelization is one method, OpenMP is better and other methods such as TBB.
Your example program will not run faster on multiple cores due to the fact that the loop has virtually no computation and is saturating the memory bus capacity.
Jim Dempsey
You qoute - "(c) This leaves the isort
Above both examples has been taken from a section of code of a file for an applications having multiple files, which performs atleast PARTIAL VECTORIZATION when tested seperately as a test case but when I do the same thing on that section of code of multiple-files, it fails, could be some other dependency which makes it to fail for multiple files. Any suggestions, which can speed up the performances for multiple files. I did add "-O3 - fp-model fast=2 -function-section - fmoit-frame-pointer" to have better speed. This multiple CPP files has too many double & floats datatypes within the code, so thought to use fp-model fast=2 or fp-model strict for having better FP speed.
It's becoming tiresome to perform Auto-vectorization for such a big multiple CPP files. Atleast if above would have worked properly, it could have made way for other CPP to work succesfully.
Any clue?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I repeat the code which I wish to VECTORIZED is below FOR loop -
---
#include
#include
#define MAX 1024
int main()
{
int i, j;
int num[MAX], isort[MAX], cluster[MAX][MAX];
for (j = 0; j < MAX; j++) {
num
isort
for (i = 0; i < MAX; i++) {
cluster
}
}
printf("%d %d %dn",num[64],isort[78],cluster[384][74]);
return 0;
}
---
Looking, if you can provide above as an example for having the best AV.
TX for your time.
~BR
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
As explained in my earlier post, the num
I suggest you place a printf or some innocuous function call (time()) in front of the loop, place a break on the innocuous function call, step over the function call, then open a dissassembly window, then step into the fast mem set routine. observe what it is doing. True, this could be inlined. It is relatively easy for you to write your own zero out routine using small vector type variables (the compiler supports these).
Jim
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If you wanted the outer loop assignments vectorized, you could arrange that, e.g. with #pragma distribute point ahead of the inner loop, but that would impact performance and code size adversely.
If those arrays aren't aligned automatically, you could improve it with alignment directives. Then, ideally, it would make no difference if you used #pragma vector aligned.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Jim was right; the first example the asm code is a nicely vectorized one. (Jim explained the code in details, so there is no need to repeat it). But if you make the Compiler a bit more agressive to loops (O3 option), it will combine the isort
[cpp]#pragma distribute point for (j = 0,i=0; j < MAX; j++) { num= 0; isort = j; for (i = 0; i < MAX; i++) { cluster = 0; } } [/cpp]
As result the compiler generated the following code with two loops:
[cpp].B1.1: ; Preds .B1.0 $LN1: push ebp ;13.1 mov ebp, esp ;13.1 and esp, -128 ;13.1 mov eax, 4202496 ;13.1 call __chkstk ;13.1 push 3 ;13.1 call ___intel_new_proc_init_P ;13.1 ; LOE ebx esi edi .B1.9: ; Preds .B1.1 $LN3: movdqa xmm2, XMMWORD PTR [_2il0floatpacket.1] ;26.2 movdqa xmm1, XMMWORD PTR [_2il0floatpacket.2] ;26.2 $LN5: add esp, 4 ;13.1 xor eax, eax ; $LN7: pxor xmm0, xmm0 ;27.12 $LN9: stmxcsr DWORD PTR [esp] ;13.1 or DWORD PTR [esp], 32768 ;13.1 ldmxcsr DWORD PTR [esp] ;13.1 ; LOE eax ebx esi edi xmm0 xmm1 xmm2 .B1.2: ; Preds .B1.2 .B1.9 $LN11: movdqa XMMWORD PTR [esp+eax*4], xmm0 ;27.3 $LN13: movdqa XMMWORD PTR [4096+esp+eax*4], xmm1 ;28.3 $LN15: movdqa XMMWORD PTR [16+esp+eax*4], xmm0 ;27.3 movdqa XMMWORD PTR [32+esp+eax*4], xmm0 ;27.3 movdqa XMMWORD PTR [48+esp+eax*4], xmm0 ;27.3 $LN17: paddd xmm1, xmm2 ;28.3 movdqa XMMWORD PTR [4112+esp+eax*4], xmm1 ;28.3 paddd xmm1, xmm2 ;28.3 movdqa XMMWORD PTR [4128+esp+eax*4], xmm1 ;28.3 paddd xmm1, xmm2 ;28.3 movdqa XMMWORD PTR [4144+esp+eax*4], xmm1 ;28.3 paddd xmm1, xmm2 ;28.3 $LN19: add eax, 16 ;26.2 cmp eax, 1024 ;26.2 jb .B1.2 ; Prob 99% ;26.2 ; LOE eax ebx esi edi xmm0 xmm1 xmm2 .B1.3: ; Preds .B1.2 xor eax, eax ; ; LOE eax ebx esi edi xmm0 .B1.4: ; Preds .B1.4 .B1.3 $LN21: movntdq XMMWORD PTR [8192+esp+eax*4], xmm0 ;31.4 $LN23: add eax, 4 ;26.2 cmp eax, 1048576 ;26.2 jb .B1.4 ; Prob 99% ;26.2 ; LOE eax ebx esi edi xmm0 .B1.5: ; Preds .B1.4 $LN25: push DWORD PTR [1581352+esp] ;39.2 push DWORD PTR [4412+esp] ;39.2 push DWORD PTR [264+esp] ;39.2 push OFFSET FLAT: _2__STRING.1.0.1 ;39.2 call DWORD PTR [__imp__printf] ;39.2 ; LOE ebx esi edi .B1.10: ; Preds .B1.5 add esp, 16 ;39.2 ; LOE ebx esi edi .B1.6: ; Preds .B1.10 $LN27: xor eax, eax ;40.9 mov esp, ebp ;40.9 pop ebp ;40.9 ret ;40.9 ALIGN 16 ; LOE ; mark_end; [/cpp]
Here B1.2 represents the loop:
[cpp] for (j = 0,i=0; j < MAX; j++) { numunrolled with factor 4.= 0; isort = j; [/cpp]
And B1.4 is the non temporal writes of the inner loop (multibied by outer loop)
[cpp] for (i = 0; i < MAX; i++) { cluster= 0; [/cpp]
- hard to imagine how to make it faster.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Jim was right; the first example the asm code is a nicely vectorized one. (Jim explained the code in details, so there is no need to repeat it). But if you make the Compiler a bit more agressive to loops (O3 option), it will combine the isort
[cpp]#pragma distribute point for (j = 0,i=0; j < MAX; j++) { num= 0; isort = j; for (i = 0; i < MAX; i++) { cluster = 0; } } [/cpp]
As result the compiler generated the following code with two loops:
[cpp].B1.1: ; Preds .B1.0 $LN1: push ebp ;13.1 mov ebp, esp ;13.1 and esp, -128 ;13.1 mov eax, 4202496 ;13.1 call __chkstk ;13.1 push 3 ;13.1 call ___intel_new_proc_init_P ;13.1 ; LOE ebx esi edi .B1.9: ; Preds .B1.1 $LN3: movdqa xmm2, XMMWORD PTR [_2il0floatpacket.1] ;26.2 movdqa xmm1, XMMWORD PTR [_2il0floatpacket.2] ;26.2 $LN5: add esp, 4 ;13.1 xor eax, eax ; $LN7: pxor xmm0, xmm0 ;27.12 $LN9: stmxcsr DWORD PTR [esp] ;13.1 or DWORD PTR [esp], 32768 ;13.1 ldmxcsr DWORD PTR [esp] ;13.1 ; LOE eax ebx esi edi xmm0 xmm1 xmm2 .B1.2: ; Preds .B1.2 .B1.9 $LN11: movdqa XMMWORD PTR [esp+eax*4], xmm0 ;27.3 $LN13: movdqa XMMWORD PTR [4096+esp+eax*4], xmm1 ;28.3 $LN15: movdqa XMMWORD PTR [16+esp+eax*4], xmm0 ;27.3 movdqa XMMWORD PTR [32+esp+eax*4], xmm0 ;27.3 movdqa XMMWORD PTR [48+esp+eax*4], xmm0 ;27.3 $LN17: paddd xmm1, xmm2 ;28.3 movdqa XMMWORD PTR [4112+esp+eax*4], xmm1 ;28.3 paddd xmm1, xmm2 ;28.3 movdqa XMMWORD PTR [4128+esp+eax*4], xmm1 ;28.3 paddd xmm1, xmm2 ;28.3 movdqa XMMWORD PTR [4144+esp+eax*4], xmm1 ;28.3 paddd xmm1, xmm2 ;28.3 $LN19: add eax, 16 ;26.2 cmp eax, 1024 ;26.2 jb .B1.2 ; Prob 99% ;26.2 ; LOE eax ebx esi edi xmm0 xmm1 xmm2 .B1.3: ; Preds .B1.2 xor eax, eax ; ; LOE eax ebx esi edi xmm0 .B1.4: ; Preds .B1.4 .B1.3 $LN21: movntdq XMMWORD PTR [8192+esp+eax*4], xmm0 ;31.4 $LN23: add eax, 4 ;26.2 cmp eax, 1048576 ;26.2 jb .B1.4 ; Prob 99% ;26.2 ; LOE eax ebx esi edi xmm0 .B1.5: ; Preds .B1.4 $LN25: push DWORD PTR [1581352+esp] ;39.2 push DWORD PTR [4412+esp] ;39.2 push DWORD PTR [264+esp] ;39.2 push OFFSET FLAT: _2__STRING.1.0.1 ;39.2 call DWORD PTR [__imp__printf] ;39.2 ; LOE ebx esi edi .B1.10: ; Preds .B1.5 add esp, 16 ;39.2 ; LOE ebx esi edi .B1.6: ; Preds .B1.10 $LN27: xor eax, eax ;40.9 mov esp, ebp ;40.9 pop ebp ;40.9 ret ;40.9 ALIGN 16 ; LOE ; mark_end; [/cpp]
Here B1.2 represents the loop:
[cpp] for (j = 0,i=0; j < MAX; j++) { numunrolled with factor 4.= 0; isort = j; [/cpp]
And B1.4 is the non temporal writes of the inner loop
[cpp] for (i = 0; i < MAX; i++) { cluster= 0; [/cpp]
- hard to imagine how to make it faster.
Hi Vladimir/Jim/Tim.
Thanks for your kind support & time.
~BR
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Vladimir,
Apparently the /O3 pulled the out-of-line call to the fast memset to 0 to be inline. Good optimization work by the compiler writers. The only questionable addition to aid speed-up might be to interleave the vectorized isort
Comments on unrolling in general:
If the compiler were to unroll these loops completely the code would run _slower_. The reason for this is the instruction cache would never get re-used and the memory fetches for instructions would slow down the memory writes.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Vladimir,
Apparently the /O3 pulled the out-of-line call to the fast memset to 0 to be inline. Good optimization work by the compiler writers. The only questionable addition to aid speed-up might be to interleave the vectorized isort
Comments on unrolling in general:
If the compiler were to unroll these loops completely the code would run _slower_. The reason for this is the instruction cache would never get re-used and the memory fetches for instructions would slow down the memory writes.
Jim Dempsey
Your suggestions are really very helpful. In some of your messages you did suggest "Your example program will not run faster on multiple cores due to the fact that the loop has virtually no computation and is saturating the memory bus capacity."
It's true that BUS UTILIZATION has increased which in return increases the memory latency as one can check the below asm having too many MOV operations.
Could you extend more for above FOR loop - How one can lessen saturation of memory bus capacity as observed by you?
I am really forcing the Compiler with -O3 option, also it seems "fp-model fast=2 or fp-model strict" spoils the performances.
But, if I take the code as orginal one and execute as "icpc -fno-builtin test.cpp -S" we have below obs. -
---
#define MAX 1024
int main()
{
int i, j;
int num[MAX], isort[MAX], cluster[MAX][MAX];
#pragma distribute point
for (j = 0; j < MAX; j++) {
num
isort
for (i = 0; i < MAX; i++) {
cluster
}
}
printf("%d %d %dn",num[64],isort[78],cluster[384][74]);
return 0;
---
whose asm is -
--
main:
..B1.1: # Preds ..B1.0
..___tag_value_main.1: #7.1
pushq %rbp #7.1
..___tag_value_main.2: #
movq %rsp, %rbp #7.1
..___tag_value_main.3: #
andq $-128, %rsp #7.1
subq $4202496, %rsp #7.1
movl $3, %edi #7.1
..___tag_value_main.5: #7.1
call __intel_new_proc_init #7.1
..___tag_value_main.6: #
# LOE rbx r12 r13 r14 r15
..B1.11: # Preds ..B1.1
stmxcsr (%rsp) #7.1
orl $32832, (%rsp) #7.1
ldmxcsr (%rsp) #7.1
xorl %eax, %eax #12.2
pxor %xmm0, %xmm0 #13.20
# LOE rax rbx r12 r13 r14 r15 xmm0
..B1.2: # Preds ..B1.2 ..B1.11
movdqa %xmm0, (%rsp,%rax,4) #13.11
movdqa %xmm0, 16(%rsp,%rax,4) #13.11
movdqa %xmm0, 32(%rsp,%rax,4) #13.11
movdqa %xmm0, 48(%rsp,%rax,4) #13.11
addq $16, %rax #12.2
cmpq $1024, %rax #12.2
jl ..B1.2 # Prob 99% #12.2
# LOE rax rbx r12 r13 r14 r15 xmm0
..B1.3: # Preds ..B1.2
movdqa _2il0floatpacket.1(%rip), %xmm2 #12.2
movdqa _2il0floatpacket.2(%rip), %xmm1 #12.2
xorl %eax, %eax #12.2
# LOE rax rbx r12 r13 r14 r15 xmm0 xmm1 xmm2
..B1.4: # Preds ..B1.4 ..B1.3
movdqa %xmm1, 4096(%rsp,%rax,4) #14.11
paddd %xmm2, %xmm1 #14.11
movdqa %xmm1, 4112(%rsp,%rax,4) #14.11
paddd %xmm2, %xmm1 #14.11
movdqa %xmm1, 4128(%rsp,%rax,4) #14.11
paddd %xmm2, %xmm1 #14.11
movdqa %xmm1, 4144(%rsp,%rax,4) #14.11
paddd %xmm2, %xmm1 #14.11
addq $16, %rax #12.2
cmpq $1024, %rax #12.2
jl ..B1.4 # Prob 99% #12.2
# LOE rax rbx r12 r13 r14 r15 xmm0 xmm1 xmm2
..B1.5: # Preds ..B1.4
xorl %eax, %eax #12.2
# LOE rax rbx r12 r13 r14 r15 xmm0
..B1.6: # Preds ..B1.6 ..B1.5
movntdq %xmm0, 8192(%rsp,%rax,4) #16.11
addq $4, %rax #12.2
cmpq $1048576, %rax #12.2
jl ..B1.6 # Prob 99% #12.2
# LOE rax rbx r12 r13 r14 r15 xmm0
..B1.7: # Preds ..B1.6
movl 256(%rsp), %esi #19.2
movl 4408(%rsp), %edx #19.2
movl 1581352(%rsp), %ecx #19.2
movl $_2__STRING.0.0, %edi #19.2
xorl %eax, %eax #19.2
..___tag_value_main.7: #19.2
call printf #19.2
--
Here in this asm, all three iterations(num[ j], cluster
(a) num
..B1.2: # Preds ..B1.2 ..B1.11
movdqa %xmm0, (%rsp,%rax,4) #13.11
movdqa %xmm0, 16(%rsp,%rax,4) #13.11
movdqa %xmm0, 32(%rsp,%rax,4) #13.11
movdqa %xmm0, 48(%rsp,%rax,4) #13.11
addq $16, %rax #12.2
cmpq $1024, %rax #12.2
jl ..B1.2 # Prob 99% #12.2
(b) isort
..B1.4: # Preds ..B1.4 ..B1.3
movdqa %xmm1, 4096(%rsp,%rax,4) #14.11
paddd %xmm2, %xmm1 #14.11
movdqa %xmm1, 4112(%rsp,%rax,4) #14.11
paddd %xmm2, %xmm1 #14.11
movdqa %xmm1, 4128(%rsp,%rax,4) #14.11
paddd %xmm2, %xmm1 #14.11
movdqa %xmm1, 4144(%rsp,%rax,4) #14.11
paddd %xmm2, %xmm1 #14.11
addq $16, %rax #12.2
cmpq $1024, %rax #12.2
jl ..B1.4 # Prob 99% #12.2
(c) cluster
..B1.6: # Preds ..B1.6 ..B1.5
movntdq %xmm0, 8192(%rsp,%rax,4) #16.11
addq $4, %rax #12.2
cmpq $1048576, %rax #12.2
jl ..B1.6 # Prob 99% #12.2
Query -
(i) If you check (c) asm, we have MOVNTDQ, do you feel that the cache has large collections of data i.e, too large to fit into the cahe?
(ii) Normally, use of MOVNTDQ speeds the code but can use of MOVNTDQ here, does speed-up the performance?)
(iii) Probably, (a) & (b) but most oftenly (b) will result with high BUS UTILIZATION, any suggestions to lessen it.
~BR
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The sample program you provided has a fixed number of bytes to write. The only way to relieve memory bus saturation is to accomplish the number of bytes to be written in as few as write cycles as possible. The processor (most new processors) can perfrom write combining. If you can arrange for your arrays to be aligned on 64 or 128 byte boundaries, this and vector writes will reduce memory bus activity. Without alignment you may have the possibility of not having at least 16 byte aligned data. With 64 byte aligned data and using the 4 in a row stores of the SSE3 register memory alignment is favorable. As to the number of memory bus cycles this will take... this would depend on the processor archetecture. In your store of 0, and store of j the fp-model option will not be a factor.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The sample program you provided has a fixed number of bytes to write. The only way to relieve memory bus saturation is to accomplish the number of bytes to be written in as few as write cycles as possible. The processor (most new processors) can perfrom write combining. If you can arrange for your arrays to be aligned on 64 or 128 byte boundaries, this and vector writes will reduce memory bus activity. Without alignment you may have the possibility of not having at least 16 byte aligned data. With 64 byte aligned data and using the 4 in a row stores of the SSE3 register memory alignment is favorable. As to the number of memory bus cycles this will take... this would depend on the processor archetecture. In your store of 0, and store of j the fp-model option will not be a factor.
Jim Dempsey
Could you check (i), (ii) & (iii) looking for your valubale insights as earlier.
I am doing all these simply to learn & explore.
TX.
~BR
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
BR,
Thanks for your confidence in my insights. Remember that these are insights and not specific declarations of fact for a particular processor archetecture.
(i) The movntdq is advantagous under some circumstances and not under others. In the case where one thread wipes a buffer and another thread (not sharing the same cache) uses the buffer for Read or Read/Modify/Write then you avoid depleating the wiping threads cache (and those threads sharing that cache) and then this would be a good use of movntdq. This also has benifits when the wiping thread (or its cache sharing partner) were to use the buffer some time later. However, if the wiping thread (or its cache sharing partner) were to use this buffer immediately then it would be disadvantagous to use movntdq.
(ii) without knowing how/when/who the buffer is used after the wipe/initialization it would be premature to declare use of mobntdq as good or bad. See (i) for additional comments on this.
(iii) What you are (should be) most interested in is best memory bus utilizaiton. High bus utilization is a good thing when the bus is used effectively. Effective meaning memory does not stall waiting for the next write, the fewest writes as possible are performed to complete the task.(a) will/(should be) capable of write combining and thus provide very effictive bus utilization, (b) on the other hand is going to have to defer the movdqa until after the paddd completes. Depending on pipeline in the CPUhardware archetecture this may or may not interfere with the write combining and/or in stalling the memory bus. If this were a problem it could be fixed in the compiler by using 3 more xmm registers and performing all 4 paddd's following all movdqa's.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Only the latest CPUs perform efficiently enough with backward copy that such a strategy might be considered, so as to leave the end of the stream in cache which will be needed next. I don't know any way to implement it, other than writing the loop with in-line intrinsics.
memmove(), if optimized for the newer CPUs, would use backward move when necessary to deal with source-destination overlap. It also should choose nontemporal according to the stream length, when there is no overlap.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page