- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Deat Intel developers,
I'm using AVX whit Intel 15.0.1 compiler. I need to load some float shifted by one in order to do some AVX operations:
sample = traces.r + traces .r[j+1];
At the moment, I do two distinct AVX load:
intr_1= _mm256_load_ps(&traces.r ); intr_2= _mm256_load_ps(&traces .r[j+1]);
In this way, I calculate sample in two distinct vectorized phase, first part for J and second part for J+1. It works, but it is quiet slow. In fact, using the second load, I load seven elements I already have and just one new element. So, maybe I can do a sort of load, and left shift and finally a second load of one elements. What is the best strategy? Thanks in advance.
- Tags:
- Parallel Computing
Link Copied
- « Previous
-
- 1
- 2
- Next »
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
jimdempseyatthecove wrote:
FWIW, the IA32 instruction set (as well as Intel64), has the capability of SIB (Scale, Index, Base)
Meaning for traces_rp[k+1] requires only 1 instruction to fetch or store or to use as one of the add/sub/div instructions.
In the above, the instruction would contain "0x04(%rBase, $rIndex, 4)" as either the source or destination of an instruction.
By using the pointer++ (4 times), you are complicating the issue.
Jim Dempsey
Hi Jim,
using your code the application obtained no further performance gain.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If the revised code is still generating something like the disassembly shown in your prior post, then something unexpected is going on. (note to others, his .jpg screenshots are sorted in source line number order, not in address order).
What are your command line options when you build?
The reason I ask is the compiler optimization appears to have not moved loop invariant code out of the loop
(1.0f-up_interp)
The above appears to be recalculated within the loop.
And the restrict pointers appear to be reloaded and stored within the loop (these are the movq instructions).]
Can you upload a little reproducer (small main and function call).
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
jimdempseyatthecove wrote:
If the revised code is still generating something like the disassembly shown in your prior post, then something unexpected is going on. (note to others, his .jpg screenshots are sorted in source line number order, not in address order).
What are your command line options when you build?
The reason I ask is the compiler optimization appears to have not moved loop invariant code out of the loop
(1.0f-up_interp)
The above appears to be recalculated within the loop.
And the restrict pointers appear to be reloaded and stored within the loop (these are the movq instructions).]
Can you upload a little reproducer (small main and function call).
Jim Dempsey
Hi Jim, it is quite harder to isolate that loop giving you a small main with a function call, because my code posted is a part of a big project. Anyway, my compilation flags are: "-xHost -O3 -restrict"
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Then can you post the VTune Assembly of that loop, sorted not by line number, but rather sorted by address. This will help isolate loop peel from main body of loop.
Note, you can select, mark and copy the text of the Assembly, then use the {...} code button to paste as Plain text.
Address Source Line Assembly CPU Time: Total CPU Time: Self 0x140001ba3 152 cmp rcx, 0x5f5e100 0.010s 0.010s 0x140001baa 152 jle 0x140001b31 <Block 85> 0x140001bdc 153 call 0x14000aa80 <omp_get_wtime> 0x140001be1 153 movaps xmm11, xmm0 0x140001be5 153 movsd qword ptr [rip+0x10e052], xmm11 0x140001c32 154 subsd xmm11, qword ptr [rip+0x10dfdd] 0x140001c3b 154 movsd qword ptr [rip+0x10e004], xmm11 0x140001bee 155 mov rsi, rsp 0x140001bf1 155 lea r9, ptr [rip+0x9854] 0x140001bf8 155 lea rcx, ptr [rsp+0x1b0] 0x140001c00 155 mov edx, 0xffffffff 0x140001c05 155 mov r8, 0x1208384ff00 0x140001c0f 155 lea rax, ptr [rip+0x983e] 0x140001c16 155 lea rbp, ptr [rsp+0x60] 0x140001c1b 155 mov qword ptr [rcx], 0x0 0x140001c22 155 mov qword ptr [rbp], 0x19 0x140001c2a 155 mov qword ptr [rbp+0x8], rax 0x140001c2e 155 mov qword ptr [rsi+0x20], rbp
The above was a (mouse) select, copy (Ctrl-C), and paste into {...} code dialog of IDZ
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Jim, i did a terrible mistake some post ago. I posted a wrong code, without an if condition that could be the explanation of non vectorization. This is the right code.
for(n = *begin_ntrace; n < *end_ntrace; n++) { r_idx = tt* inv_sampling + 1.0f; i_idx1 = (int32)r_idx; i_idx2 = i_idx1 + dms - low; if(i_idx1 > low && i_idx2 < ns) { ntr++; up_interp = r_idx - (float32)i_idx1; i_idx1 -= low + 1; { float * restrict num_rp = &num_r[0]; float * restrict num_ip = &num_i[0]; float * restrict traces_rp = &traces .r[i_idx1]; float * restrict traces_ip = &traces .i[i_idx1]; int i_iter = i_idx2 - i_idx1 + 1; // # iterations // *** use loop control variable and index that is scoped inside the for loop int k; for(k = 0; k < i_iter; k++) { num_rp = num_rp + (1.0f-up_interp)*(traces_rp ) + up_interp*traces_rp[k+1]; num_ip = num_ip + (1.0f-up_interp)*(traces_ip ) + up_interp*traces_ip[k+1]; } } denom += traces .r[i_idx2] * traces .r[i_idx2] + traces .i[i_idx2] * traces .i[i_idx2]; } }
And this is the Assembly generated:
Address Source Line Assembly 0x3eee Block 12: 0x3eee 1,385 movsxdl (%rcx), %r10 0x3ef1 1,385 xor %ecx, %ecx 0x3ef3 1,385 movsxdl (%rsi), %rsi 0x3ef6 1,385 xor %eax, %eax 0x3ef8 1,385 cmp %r10, %rsi 0x3efb 1,385 jnl 0x433a <Block 48> 0x3f01 Block 13: 0x3f01 1,391 movl 0x204234(%rip), %r13d 0x3f08 1,386 lea (%r9,%rsi,4), %r15 0x3f0c 1,389 movl 0x204221(%rip), %r11d 0x3f13 1,389 mov %r12d, %r14d 0x3f16 1,391 movl %r13d, -0x60(%rbp) 0x3f1a 1,425 mov %r8, %r9 0x3f1d 1,396 movsxd %r11d, %r13 0x3f20 1,385 sub %rsi, %r10 0x3f23 1,403 shl $0x4, %rsi 0x3f27 1,389 sub %r11d, %r14d 0x3f2a 1,425 sub %r13, %r9 0x3f2d 1,386 vmovssl 0x2041f7(%rip), %xmm1 0x3f35 1,386 vmovssl 0x1ba7(%rip), %xmm0 0x3f3d 1,403 addq 0x2041bc(%rip), %rsi 0x3f44 1,425 movq %r8, -0xc8(%rbp) 0x3f4b 1,425 movq %r9, -0x80(%rbp) 0x3f4f 1,425 movq %r13, -0x70(%rbp) 0x3f53 1,425 movl %r14d, -0x58(%rbp) 0x3f57 1,425 movq %r10, -0x50(%rbp) 0x3f5b 1,425 movq %r15, -0x40(%rbp) 0x3f5f 1,425 movl %r11d, -0x48(%rbp) 0x3f63 1,425 movq %rdx, -0x90(%rbp) 0x3f6a 1,425 movq %rdi, -0x98(%rbp) 0x3f71 1,425 movl %r12d, -0xd0(%rbp) 0x3f78 Block 14: 0x3f78 1,386 vmovaps %xmm0, %xmm4 0x3f7c 1,386 movq -0x40(%rbp), %rdx 0x3f80 1,386 vfmadd231ssl (%rdx,%rcx,4), %xmm1, %xmm4 0x3f86 1,388 vcvttss2si %xmm4, %r13d 0.4% 0x3f8a 1,391 cmpl -0x48(%rbp), %r13d 0x3f8e 1,391 jle 0x430d <Block 46> 0x3f94 Block 15: 0x3f94 1,389 movl -0x58(%rbp), %edx 0.1% 0x3f97 1,389 lea (%rdx,%r13,1), %r15d 0x3f9b 1,391 cmpl -0x60(%rbp), %r15d 0x3f9f 1,391 jnl 0x430d <Block 46> 0x3fa5 Block 16: 0x3fa5 1,394 vxorps %xmm5, %xmm5, %xmm5 0x3fa9 1,394 vcvtsi2ss %r13d, %xmm5, %xmm5 0x3fae 1,396 movsxd %r13d, %r13 0x3fb1 1,394 vsubss %xmm5, %xmm4, %xmm6 0x3fb5 1,396 mov %r13, %r14 0x3fb8 1,396 subq -0x70(%rbp), %r14 0x3fbc 1,403 movq (%rax,%rsi,1), %rdx 0x3fc0 1,404 movq 0x8(%rax,%rsi,1), %r12 0x3fc5 1,392 incl -0x78(%rbp) 0x3fc8 1,396 lea -0x1(%r14), %r11 0x3fcc 1,405 mov %r11d, %r9d 0.1% 0x3fcf 1,403 lea -0x4(%rdx,%r14,4), %r10 0x3fd4 1,405 neg %r9d 0.1% 0x3fd7 1,404 lea -0x4(%r12,%r14,4), %r8 0x3fdc 1,405 add %r15d, %r9d 0x3fdf 1,403 movq %rdx, -0x68(%rbp) 0x3fe3 1,408 test %r9d, %r9d 0x3fe6 1,408 jle 0x40fe <Block 25> 0x3fec Block 17: 0x3fec 1,408 movsxd %r9d, %rdi 0x3fef 1,408 cmp $0x8, %rdi 0x3ff3 1,408 jl 0x4501 <Block 69> 0x3ff9 Block 18: 0x3ff9 1,408 mov %r9d, %edx 0x3ffc 1,409 vsubss %xmm6, %xmm0, %xmm5 0x4000 1,368 vbroadcastss %xmm6, %ymm4 0x4005 1,408 movq $0x0, -0x88(%rbp) 0x4010 1,408 and $0xfffffff8, %edx 0x4013 1,368 movq %rsi, -0xb0(%rbp) 0x401a 1,368 movq %rax, -0xa8(%rbp) 0x4021 1,368 movq %rcx, -0xa0(%rbp) 0x4028 1,408 movsxd %edx, %rdx 0x402b 1,409 vbroadcastss %xmm5, %ymm5 0x4030 1,368 movq -0x88(%rbp), %rax 0x4037 1,368 movq -0x90(%rbp), %rcx 0x403e 1,368 movq -0x98(%rbp), %rsi 0x4045 Block 19: 0x4045 1,409 vmovupsy (%r10,%rax,4), %ymm7 0x404b 1,410 vmovupsy (%r8,%rax,4), %ymm8 0x4051 1,409 vfmadd213psy (%rsi,%rax,4), %ymm5, %ymm7 0x4057 1,410 vfmadd213psy (%rcx,%rax,4), %ymm5, %ymm8 0x405d 1,409 vfmadd231psy 0x4(%r10,%rax,4), %ymm4, %ymm7 0x4064 1,410 vfmadd231psy 0x4(%r8,%rax,4), %ymm4, %ymm8 0x406b 1,409 vmovupsy %ymm7, (%rsi,%rax,4) 0x4070 1,410 vmovupsy %ymm8, (%rcx,%rax,4) 0x4075 1,408 add $0x8, %rax 0x4079 1,408 cmp %rdx, %rax 0x407c 1,408 jb 0x4045 <Block 19> 0x407e Block 20: 0x407e 1,408 movq -0xb0(%rbp), %rsi 0x4085 1,408 movq -0xa8(%rbp), %rax 0x408c 1,408 movq -0xa0(%rbp), %rcx 0x4093 Block 21: 0x4093 1,408 cmp %rdi, %rdx 0x4096 1,408 jnb 0x40fe <Block 25> 0x4098 Block 22: 0x4098 1,409 movq %rax, -0xa8(%rbp) 0x409f 1,409 vsubss %xmm6, %xmm0, %xmm4 0x40a3 1,409 movq %rcx, -0xa0(%rbp) 0x40aa 1,409 movq -0x90(%rbp), %rax 0x40b1 1,409 movq -0x98(%rbp), %rcx 0x40b8 Block 23: 0x40b8 1,409 vmovssl (%r10,%rdx,4), %xmm5 0x40be 1,410 vmovssl (%r8,%rdx,4), %xmm7 0x40c4 1,409 vfmadd213ssl (%rcx,%rdx,4), %xmm4, %xmm5 0x40ca 1,410 vfmadd213ssl (%rax,%rdx,4), %xmm4, %xmm7 0x40d0 1,409 vfmadd231ssl 0x4(%r10,%rdx,4), %xmm6, %xmm5 0x40d7 1,410 vfmadd231ssl 0x4(%r8,%rdx,4), %xmm6, %xmm7 0x40de 1,409 vmovssl %xmm5, (%rcx,%rdx,4) 0x40e3 1,410 vmovssl %xmm7, (%rax,%rdx,4) 0x40e8 1,408 inc %rdx 0x40eb 1,408 cmp %rdi, %rdx 0x40ee 1,408 jb 0x40b8 <Block 23> 0x40f0 Block 24: 0x40f0 1,408 movq -0xa8(%rbp), %rax 0x40f7 1,408 movq -0xa0(%rbp), %rcx
Sorry for the error, but I have two very similar piece of codes and I confused the two source codes. My apologies.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In the above assembly listing, the loop block 19 (vector) is the loop that runs _after_ it figures out the data is aligned. Block 23 (scalar) runs if the data is not aligned.
Align the allocations for num_r and num_i,, and then use assume aligned on the two pointers.
It does not look like you can control i_idx1 therefore, traces_rp and traces_ip must not be "assumed aligned" (unless you can assert i_idx1 produces an aligned index).
What this possibly can do is to eliminate some of the code that determines which path to take.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
jimdempseyatthecove wrote:
In the above assembly listing, the loop block 19 (vector) is the loop that runs _after_ it figures out the data is aligned. Block 23 (scalar) runs if the data is not aligned.
Align the allocations for num_r and num_i,, and then use assume aligned on the two pointers.
It does not look like you can control i_idx1 therefore, traces_rp and traces_ip must not be "assumed aligned" (unless you can assert i_idx1 produces an aligned index).
What this possibly can do is to eliminate some of the code that determines which path to take.
Jim Dempsey
Is it right as is?
__declspec(align(16)) float32 num_r[dms]; __declspec(align(16)) float32 num_i[dms]; for(n = *begin_ntrace; n < *end_ntrace; n++) { r_idx = tt* inv_sampling + 1.0f; i_idx1 = (int32)r_idx; i_idx2 = i_idx1 + dms - low; if(i_idx1 > low && i_idx2 < ns) { ntr++; up_interp = r_idx - (float32)i_idx1; i_idx1 -= low + 1; { float * restrict num_rp = &num_r[0]; float * restrict num_ip = &num_i[0]; __assume_aligned(num_rp, 16); __assume_aligned(num_ip, 16); float * restrict traces_rp = &traces .r[i_idx1]; float * restrict traces_ip = &traces .i[i_idx1]; int i_iter = i_idx2 - i_idx1 + 1; // # iterations // *** use loop control variable and index that is scoped inside the for loop int k; for(k = 0; k < i_iter; k++) { num_rp = num_rp + (1.0f-up_interp)*(traces_rp ) + up_interp*traces_rp[k+1]; num_ip = num_ip + (1.0f-up_interp)*(traces_ip ) + up_interp*traces_ip[k+1]; } } denom += traces .r[i_idx2] * traces .r[i_idx2] + traces .i[i_idx2] * traces .i[i_idx2]; } }
If yes, code is little more slow. And, from what you said, In the Assembly may be present more version of a block and the compiler will choose the most appropriate?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Adding aligned should not have affected the performance.
What likely happened is the code alignment may have changed such that an additional instruction cache line was crossed, and/or the top of the predominant loop was not favorably aligned. In some programs I've seen from 3%-5% difference for a specific loop. The compiler does not have a #pragma that informs the compiler to padd the code to a cache line..
What does VTune show as the predominant path?
Does the for(n= loop produce overlapping traces?
How many times does the for(n loop iterate (on average)
Is there sufficient potential return on your coding effort to this code to warrant additional coding?
How often are traces_rp and traces_ip read verses written?
Can you safely accumulate values into num_rp and num_ip that preceed i_idx1?
IOW can you backup i_idx1 to a cache line? (same with running i_idx2 out further to end on a cache line?)
Or, must they be excluded?
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
jimdempseyatthecove wrote:
Adding aligned should not have affected the performance.
What likely happened is the code alignment may have changed such that an additional instruction cache line was crossed, and/or the top of the predominant loop was not favorably aligned. In some programs I've seen from 3%-5% difference for a specific loop. The compiler does not have a #pragma that informs the compiler to padd the code to a cache line..
What does VTune show as the predominant path?
Does the for(n= loop produce overlapping traces?
How many times does the for(n loop iterate (on average)
Is there sufficient potential return on your coding effort to this code to warrant additional coding?
How often are traces_rp and traces_ip read verses written?
Can you safely accumulate values into num_rp and num_ip that preceed i_idx1?
IOW can you backup i_idx1 to a cache line? (same with running i_idx2 out further to end on a cache line?)
Or, must they be excluded?Jim Dempsey
Hi Jim, thanks very much for your help. I'll try to reply to your questions. The program spend about 50% of total time in this piece of code, so it is very important to optimize it. traces are written just one time, in the initialization phase. After are ever only read. The external loop is about 3.000 iterations, the internal just 20 more or less. and the if condition is true sometimes for a 90% of number of traces, and sometimes just for a 10%.
I think I can't safely accumulate values into num_rp and num_ip that preceed i_idx1 but I'm not totally sure. Finally, you ask: "can you backup i_idx1 to a cache line?" I don't understand this question. Unfortunately, the physics behind that piece of code requires very scattered access to the traces, so It is not possible to optimize from the algorithm point of view.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Here is a suggestion:
1) Make two copies of each traces: traces_rp, traces_rp1, traces_ip, traces_ip1. Where the current traces_rp and traces_ip are cache line aligned. And traces_rp1 and traces_ip1 are copies of traces_rp and traces_ip *** but starting at index [1]. IOW for each (rp/ip) you have two aligned arrays.
2) Set your aligned restrict pointers to the [0]'th index of each array, and replace the for(k=0 with for(k=i_itr1; ... This way the compiler can know that the base of the arrays are aligned, and then regardless of the starting index, it can generate a scalar peel loop to iterate up to the alignment index. The ...rp and ...rp1 will always align at the same k, thus permitting the code to shift to the vector loads.
3) Caution, on 32-bit programming you have a very small number of general purpose registers. Examine the code as before. You may need to break the inner loop into two loops (the compiler may do this for you, if not do it by hand)
If the inner loop is ~20 and if your vector width is 8, then on average, the scalar peel loop will iterate half the vector width, full vector will run the interior, but you may have a remainder to run in scalar. Note, if your instruction set is avx2 (I believe it is) then the mask variant of the instructions can be used for the peel and remainder. Thus potentially reducing the loop to three or four operations.
The only disadvantage is the ...rp and ...ip (and ...1) arrays now consume 2x the cache.
It is hard to tell without making test runs as to if the increased vectorization will be better than loss of some cache.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Could you check if I understood well what you suggest?
float * restrict num_rp = &num_r[0]; float * restrict num_ip = &num_i[0]; float * restrict traces_rp = &traces.r[0]; float * restrict traces_rp1 = &traces_copy .r[1]; float * restrict traces_ip = &traces .i[0]; float * restrict traces_ip1 = &traces_copy .i[1]; int i_iter = i_idx2 - i_idx1; // # iterations // *** use loop control variable and index that is scoped inside the for loop int k; for(k = i_idx1; k < i_idx2; k++) { num_rp[k - i_idx1] = num_rp[k - i_idx1] + (1.0f-up_interp)*(traces_rp ) + up_interp*traces_rp1 ; num_ip[k - i_idx1] = num_ip[k - i_idx1] + (1.0f-up_interp)*(traces_ip ) + up_interp*traces_ip1 ; }
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
No,
// num_r is an array of floats that are cache line aligned // num_i is an array of ints that are cach line aligned // traces_r is an array of floats that are cache line aligned // traces_r1 is an array of floats that are cache line aligned... // and contains a floats of traces_r[1] through traces_r[last_r] (whatever last_r is) // traces_i is an array of ints that are cache line aligned // traces_i1 is an array of ints that are cache line aligned... // and contains a ints of traces_i[1] through traces_i[last_i] (whatever last_i is) for(int k = i_idx1; k <= i_idx2; k++) { num_r= num_r + (1.0f-up_interp)*(traces_r ) + up_interp*traces_r1 ; num_i = num_i + (1.0f-up_interp)*(traces_i ) + up_interp*traces_i1 ; }
The above will require at least 7 GP registers, therefore you may need to perform the above in two loops (when in 32-bit mode, 64-bit should be ok),
for(int k = i_idx1; k <= i_idx2; k++) { num_r= num_r + (1.0f-up_interp)*(traces_r ) + up_interp*traces_r1 ; } for(int k = i_idx1; k <= i_idx2; k++) { num_i = num_i + (1.0f-up_interp)*(traces_i ) + up_interp*traces_i1 ; }
Jim Dempsey
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- « Previous
-
- 1
- 2
- Next »