Quote:jimdempseyatthecove - Page 2

unrue · ‎03-11-2016

Deat Intel developers,

I'm using AVX whit Intel 15.0.1 compiler. I need to load some float shifted by one in order to do some AVX operations:

sample = traces.r + traces.r[j+1];

At the moment, I do two distinct AVX load:

intr_1= _mm256_load_ps(&traces.r);
intr_2= _mm256_load_ps(&traces.r[j+1]);

In this way, I calculate sample in two distinct vectorized phase, first part for J and second part for J+1. It works, but it is quiet slow. In fact, using the second load, I load seven elements I already have and just one new element. So, maybe I can do a sort of load, and left shift and finally a second load of one elements. What is the best strategy? Thanks in advance.

unrue · ‎03-18-2016

jimdempseyatthecove wrote:

FWIW, the IA32 instruction set (as well as Intel64), has the capability of SIB (Scale, Index, Base)

Meaning for traces_rp[k+1] requires only 1 instruction to fetch or store or to use as one of the add/sub/div instructions.

In the above, the instruction would contain "0x04(%rBase, $rIndex, 4)" as either the source or destination of an instruction.

By using the pointer++ (4 times), you are complicating the issue.

Jim Dempsey

Hi Jim,

using your code the application obtained no further performance gain.

jimdempseyatthecove · ‎03-18-2016

If the revised code is still generating something like the disassembly shown in your prior post, then something unexpected is going on. (note to others, his .jpg screenshots are sorted in source line number order, not in address order).

What are your command line options when you build?

The reason I ask is the compiler optimization appears to have not moved loop invariant code out of the loop

(1.0f-up_interp)

The above appears to be recalculated within the loop.

And the restrict pointers appear to be reloaded and stored within the loop (these are the movq instructions).]

Can you upload a little reproducer (small main and function call).

Jim Dempsey

unrue · ‎03-18-2016

jimdempseyatthecove wrote:

If the revised code is still generating something like the disassembly shown in your prior post, then something unexpected is going on. (note to others, his .jpg screenshots are sorted in source line number order, not in address order).

What are your command line options when you build?

The reason I ask is the compiler optimization appears to have not moved loop invariant code out of the loop

(1.0f-up_interp)

The above appears to be recalculated within the loop.

And the restrict pointers appear to be reloaded and stored within the loop (these are the movq instructions).]

Can you upload a little reproducer (small main and function call).

Jim Dempsey

Hi Jim, it is quite harder to isolate that loop giving you a small main with a function call, because my code posted is a part of a big project. Anyway, my compilation flags are: "-xHost -O3 -restrict"

jimdempseyatthecove · ‎03-18-2016

Then can you post the VTune Assembly of that loop, sorted not by line number, but rather sorted by address. This will help isolate loop peel from main body of loop.

Note, you can select, mark and copy the text of the Assembly, then use the {...} code button to paste as Plain text.

Address Source Line Assembly CPU Time: Total CPU Time: Self
0x140001ba3 152 cmp rcx, 0x5f5e100 0.010s 0.010s
0x140001baa 152 jle 0x140001b31 <Block 85>  
0x140001bdc 153 call 0x14000aa80 <omp_get_wtime>  
0x140001be1 153 movaps xmm11, xmm0  
0x140001be5 153 movsd qword ptr [rip+0x10e052], xmm11  
0x140001c32 154 subsd xmm11, qword ptr [rip+0x10dfdd]  
0x140001c3b 154 movsd qword ptr [rip+0x10e004], xmm11  
0x140001bee 155 mov rsi, rsp  
0x140001bf1 155 lea r9, ptr [rip+0x9854]  
0x140001bf8 155 lea rcx, ptr [rsp+0x1b0]  
0x140001c00 155 mov edx, 0xffffffff  
0x140001c05 155 mov r8, 0x1208384ff00  
0x140001c0f 155 lea rax, ptr [rip+0x983e]  
0x140001c16 155 lea rbp, ptr [rsp+0x60]  
0x140001c1b 155 mov qword ptr [rcx], 0x0  
0x140001c22 155 mov qword ptr [rbp], 0x19  
0x140001c2a 155 mov qword ptr [rbp+0x8], rax  
0x140001c2e 155 mov qword ptr [rsi+0x20], rbp

The above was a (mouse) select, copy (Ctrl-C), and paste into {...} code dialog of IDZ

Jim Dempsey

unrue · ‎03-22-2016

Hi Jim, i did a terrible mistake some post ago. I posted a wrong code, without an if condition that could be the explanation of non vectorization. This is the right code.

for(n = *begin_ntrace; n < *end_ntrace; n++) {
      r_idx   = tt * inv_sampling + 1.0f;

      i_idx1  = (int32)r_idx;
      i_idx2  = i_idx1 + dms - low;

      if(i_idx1 > low && i_idx2 < ns) {
          ntr++;

          up_interp = r_idx  - (float32)i_idx1;

          i_idx1  -= low + 1;
 
          {
  
              float * restrict num_rp = &num_r[0];
              float * restrict num_ip = &num_i[0];
              float * restrict traces_rp = &traces.r[i_idx1];
              float * restrict traces_ip = &traces.i[i_idx1];
              int i_iter = i_idx2 - i_idx1 + 1; // # iterations
              // *** use loop control variable and index that is scoped inside the for loop
              int k;
              for(k = 0; k < i_iter; k++) {
                  num_rp = num_rp + (1.0f-up_interp)*(traces_rp) + up_interp*traces_rp[k+1];
                  num_ip = num_ip + (1.0f-up_interp)*(traces_ip) + up_interp*traces_ip[k+1];
              }

          }

          denom += traces.r[i_idx2] * traces.r[i_idx2] + traces.i[i_idx2] * traces.i[i_idx2];
 
      }

}

And this is the Assembly generated:

Address	Source Line	Assembly	
0x3eee		Block 12:				
0x3eee	1,385	movsxdl  (%rcx), %r10			
0x3ef1	1,385	xor %ecx, %ecx				
0x3ef3	1,385	movsxdl  (%rsi), %rsi				
0x3ef6	1,385	xor %eax, %eax			
0x3ef8	1,385	cmp %r10, %rsi			
0x3efb	1,385	jnl 0x433a <Block 48>				
0x3f01		Block 13:				
0x3f01	1,391	movl  0x204234(%rip), %r13d				
0x3f08	1,386	lea (%r9,%rsi,4), %r15				
0x3f0c	1,389	movl  0x204221(%rip), %r11d				
0x3f13	1,389	mov %r12d, %r14d				
0x3f16	1,391	movl  %r13d, -0x60(%rbp)				
0x3f1a	1,425	mov %r8, %r9				
0x3f1d	1,396	movsxd %r11d, %r13				
0x3f20	1,385	sub %rsi, %r10				
0x3f23	1,403	shl $0x4, %rsi				
0x3f27	1,389	sub %r11d, %r14d				
0x3f2a	1,425	sub %r13, %r9				
0x3f2d	1,386	vmovssl  0x2041f7(%rip), %xmm1				
0x3f35	1,386	vmovssl  0x1ba7(%rip), %xmm0				
0x3f3d	1,403	addq  0x2041bc(%rip), %rsi				
0x3f44	1,425	movq  %r8, -0xc8(%rbp)				
0x3f4b	1,425	movq  %r9, -0x80(%rbp)				
0x3f4f	1,425	movq  %r13, -0x70(%rbp)				
0x3f53	1,425	movl  %r14d, -0x58(%rbp)				
0x3f57	1,425	movq  %r10, -0x50(%rbp)				
0x3f5b	1,425	movq  %r15, -0x40(%rbp)				
0x3f5f	1,425	movl  %r11d, -0x48(%rbp)				
0x3f63	1,425	movq  %rdx, -0x90(%rbp)				
0x3f6a	1,425	movq  %rdi, -0x98(%rbp)				
0x3f71	1,425	movl  %r12d, -0xd0(%rbp)				
0x3f78		Block 14:				
0x3f78	1,386	vmovaps %xmm0, %xmm4	
0x3f7c	1,386	movq  -0x40(%rbp), %rdx	
0x3f80	1,386	vfmadd231ssl  (%rdx,%rcx,4), %xmm1, %xmm4		
0x3f86	1,388	vcvttss2si %xmm4, %r13d	0.4%	
0x3f8a	1,391	cmpl  -0x48(%rbp), %r13d	
0x3f8e	1,391	jle 0x430d <Block 46>				
0x3f94		Block 15:				
0x3f94	1,389	movl  -0x58(%rbp), %edx	0.1%	
0x3f97	1,389	lea (%rdx,%r13,1), %r15d	
0x3f9b	1,391	cmpl  -0x60(%rbp), %r15d	
0x3f9f	1,391	jnl 0x430d <Block 46>				
0x3fa5		Block 16:				
0x3fa5	1,394	vxorps %xmm5, %xmm5, %xmm5	
0x3fa9	1,394	vcvtsi2ss %r13d, %xmm5, %xmm5		
0x3fae	1,396	movsxd %r13d, %r13	
0x3fb1	1,394	vsubss %xmm5, %xmm4, %xmm6		
0x3fb5	1,396	mov %r13, %r14	
0x3fb8	1,396	subq  -0x70(%rbp), %r14		
0x3fbc	1,403	movq  (%rax,%rsi,1), %rdx	
0x3fc0	1,404	movq  0x8(%rax,%rsi,1), %r12	
0x3fc5	1,392	incl  -0x78(%rbp)	
0x3fc8	1,396	lea -0x1(%r14), %r11	
0x3fcc	1,405	mov %r11d, %r9d	0.1%	
0x3fcf	1,403	lea -0x4(%rdx,%r14,4), %r10				
0x3fd4	1,405	neg %r9d	0.1%	
0x3fd7	1,404	lea -0x4(%r12,%r14,4), %r8	
0x3fdc	1,405	add %r15d, %r9d		
0x3fdf	1,403	movq  %rdx, -0x68(%rbp)		
0x3fe3	1,408	test %r9d, %r9d		
0x3fe6	1,408	jle 0x40fe <Block 25>				
0x3fec		Block 17:				
0x3fec	1,408	movsxd %r9d, %rdi		
0x3fef	1,408	cmp $0x8, %rdi		
0x3ff3	1,408	jl 0x4501 <Block 69>				
0x3ff9		Block 18:				
0x3ff9	1,408	mov %r9d, %edx		
0x3ffc	1,409	vsubss %xmm6, %xmm0, %xmm5	
0x4000	1,368	vbroadcastss %xmm6, %ymm4		
0x4005	1,408	movq  $0x0, -0x88(%rbp)		
0x4010	1,408	and $0xfffffff8, %edx		
0x4013	1,368	movq  %rsi, -0xb0(%rbp)	
0x401a	1,368	movq  %rax, -0xa8(%rbp)		
0x4021	1,368	movq  %rcx, -0xa0(%rbp)		
0x4028	1,408	movsxd %edx, %rdx		
0x402b	1,409	vbroadcastss %xmm5, %ymm5	
0x4030	1,368	movq  -0x88(%rbp), %rax		
0x4037	1,368	movq  -0x90(%rbp), %rcx		
0x403e	1,368	movq  -0x98(%rbp), %rsi		
0x4045		Block 19:				
0x4045	1,409	vmovupsy  (%r10,%rax,4), %ymm7	
0x404b	1,410	vmovupsy  (%r8,%rax,4), %ymm8	
0x4051	1,409	vfmadd213psy  (%rsi,%rax,4), %ymm5, %ymm7	
0x4057	1,410	vfmadd213psy  (%rcx,%rax,4), %ymm5, %ymm8	
0x405d	1,409	vfmadd231psy  0x4(%r10,%rax,4), %ymm4, %ymm7	
0x4064	1,410	vfmadd231psy  0x4(%r8,%rax,4), %ymm4, %ymm8	
0x406b	1,409	vmovupsy  %ymm7, (%rsi,%rax,4)	
0x4070	1,410	vmovupsy  %ymm8, (%rcx,%rax,4)		
0x4075	1,408	add $0x8, %rax	
0x4079	1,408	cmp %rdx, %rax		
0x407c	1,408	jb 0x4045 <Block 19>				
0x407e		Block 20:				
0x407e	1,408	movq  -0xb0(%rbp), %rsi	
0x4085	1,408	movq  -0xa8(%rbp), %rax		
0x408c	1,408	movq  -0xa0(%rbp), %rcx	
0x4093		Block 21:				
0x4093	1,408	cmp %rdi, %rdx		
0x4096	1,408	jnb 0x40fe <Block 25>				
0x4098		Block 22:				
0x4098	1,409	movq  %rax, -0xa8(%rbp)		
0x409f	1,409	vsubss %xmm6, %xmm0, %xmm4		
0x40a3	1,409	movq  %rcx, -0xa0(%rbp)	
0x40aa	1,409	movq  -0x90(%rbp), %rax				
0x40b1	1,409	movq  -0x98(%rbp), %rcx	
0x40b8		Block 23:				
0x40b8	1,409	vmovssl  (%r10,%rdx,4), %xmm5	
0x40be	1,410	vmovssl  (%r8,%rdx,4), %xmm7	
0x40c4	1,409	vfmadd213ssl  (%rcx,%rdx,4), %xmm4, %xmm5		
0x40ca	1,410	vfmadd213ssl  (%rax,%rdx,4), %xmm4, %xmm7	
0x40d0	1,409	vfmadd231ssl  0x4(%r10,%rdx,4), %xmm6, %xmm5	
0x40d7	1,410	vfmadd231ssl  0x4(%r8,%rdx,4), %xmm6, %xmm7	
0x40de	1,409	vmovssl  %xmm5, (%rcx,%rdx,4)	
0x40e3	1,410	vmovssl  %xmm7, (%rax,%rdx,4)	
0x40e8	1,408	inc %rdx	
0x40eb	1,408	cmp %rdi, %rdx	
0x40ee	1,408	jb 0x40b8 <Block 23>				
0x40f0		Block 24:				
0x40f0	1,408	movq  -0xa8(%rbp), %rax		
0x40f7	1,408	movq  -0xa0(%rbp), %rcx

Sorry for the error, but I have two very similar piece of codes and I confused the two source codes. My apologies.

jimdempseyatthecove · ‎03-22-2016

In the above assembly listing, the loop block 19 (vector) is the loop that runs _after_ it figures out the data is aligned. Block 23 (scalar) runs if the data is not aligned.

Align the allocations for num_r and num_i,, and then use assume aligned on the two pointers.

It does not look like you can control i_idx1 therefore, traces_rp and traces_ip must not be "assumed aligned" (unless you can assert i_idx1 produces an aligned index).

What this possibly can do is to eliminate some of the code that determines which path to take.

Jim Dempsey

unrue · ‎03-22-2016

jimdempseyatthecove wrote:

In the above assembly listing, the loop block 19 (vector) is the loop that runs _after_ it figures out the data is aligned. Block 23 (scalar) runs if the data is not aligned.

Align the allocations for num_r and num_i,, and then use assume aligned on the two pointers.

It does not look like you can control i_idx1 therefore, traces_rp and traces_ip must not be "assumed aligned" (unless you can assert i_idx1 produces an aligned index).

What this possibly can do is to eliminate some of the code that determines which path to take.

Jim Dempsey

Is it right as is?

  __declspec(align(16)) float32 num_r[dms];
  __declspec(align(16)) float32 num_i[dms];


for(n = *begin_ntrace; n < *end_ntrace; n++) {
      r_idx   = tt * inv_sampling + 1.0f;

      i_idx1  = (int32)r_idx;
      i_idx2  = i_idx1 + dms - low;

      if(i_idx1 > low && i_idx2 < ns) {
          ntr++;

          up_interp = r_idx  - (float32)i_idx1;

          i_idx1  -= low + 1;
 
          {
  
              float * restrict num_rp = &num_r[0];
              float * restrict num_ip = &num_i[0];
         
              __assume_aligned(num_rp, 16);
              __assume_aligned(num_ip, 16); 

              float * restrict traces_rp = &traces.r[i_idx1];
              float * restrict traces_ip = &traces.i[i_idx1];
              int i_iter = i_idx2 - i_idx1 + 1; // # iterations
              // *** use loop control variable and index that is scoped inside the for loop
              int k;
              for(k = 0; k < i_iter; k++) {
                  num_rp = num_rp + (1.0f-up_interp)*(traces_rp) + up_interp*traces_rp[k+1];
                  num_ip = num_ip + (1.0f-up_interp)*(traces_ip) + up_interp*traces_ip[k+1];
              }

          }

          denom += traces.r[i_idx2] * traces.r[i_idx2] + traces.i[i_idx2] * traces.i[i_idx2];
 
      }

}

If yes, code is little more slow. And, from what you said, In the Assembly may be present more version of a block and the compiler will choose the most appropriate?

jimdempseyatthecove · ‎03-22-2016

Adding aligned should not have affected the performance.

What likely happened is the code alignment may have changed such that an additional instruction cache line was crossed, and/or the top of the predominant loop was not favorably aligned. In some programs I've seen from 3%-5% difference for a specific loop. The compiler does not have a #pragma that informs the compiler to padd the code to a cache line..

What does VTune show as the predominant path?

Does the for(n= loop produce overlapping traces?

How many times does the for(n loop iterate (on average)

Is there sufficient potential return on your coding effort to this code to warrant additional coding?

How often are traces_rp and traces_ip read verses written?

Can you safely accumulate values into num_rp and num_ip that preceed i_idx1?
IOW can you backup i_idx1 to a cache line? (same with running i_idx2 out further to end on a cache line?)
Or, must they be excluded?

Jim Dempsey

unrue · ‎03-24-2016

jimdempseyatthecove wrote:

Adding aligned should not have affected the performance.

What likely happened is the code alignment may have changed such that an additional instruction cache line was crossed, and/or the top of the predominant loop was not favorably aligned. In some programs I've seen from 3%-5% difference for a specific loop. The compiler does not have a #pragma that informs the compiler to padd the code to a cache line..

What does VTune show as the predominant path?

Does the for(n= loop produce overlapping traces?

How many times does the for(n loop iterate (on average)

Is there sufficient potential return on your coding effort to this code to warrant additional coding?

How often are traces_rp and traces_ip read verses written?

Can you safely accumulate values into num_rp and num_ip that preceed i_idx1?
IOW can you backup i_idx1 to a cache line? (same with running i_idx2 out further to end on a cache line?)
Or, must they be excluded?

Jim Dempsey

Hi Jim, thanks very much for your help. I'll try to reply to your questions. The program spend about 50% of total time in this piece of code, so it is very important to optimize it. traces are written just one time, in the initialization phase. After are ever only read. The external loop is about 3.000 iterations, the internal just 20 more or less. and the if condition is true sometimes for a 90% of number of traces, and sometimes just for a 10%.

I think I can't safely accumulate values into num_rp and num_ip that preceed i_idx1 but I'm not totally sure. Finally, you ask: "can you backup i_idx1 to a cache line?" I don't understand this question. Unfortunately, the physics behind that piece of code requires very scattered access to the traces, so It is not possible to optimize from the algorithm point of view.

jimdempseyatthecove · ‎03-24-2016

Here is a suggestion:

1) Make two copies of each traces: traces_rp, traces_rp1, traces_ip, traces_ip1. Where the current traces_rp and traces_ip are cache line aligned. And traces_rp1 and traces_ip1 are copies of traces_rp and traces_ip *** but starting at index [1]. IOW for each (rp/ip) you have two aligned arrays.

2) Set your aligned restrict pointers to the [0]'th index of each array, and replace the for(k=0 with for(k=i_itr1; ... This way the compiler can know that the base of the arrays are aligned, and then regardless of the starting index, it can generate a scalar peel loop to iterate up to the alignment index. The ...rp and ...rp1 will always align at the same k, thus permitting the code to shift to the vector loads.

3) Caution, on 32-bit programming you have a very small number of general purpose registers. Examine the code as before. You may need to break the inner loop into two loops (the compiler may do this for you, if not do it by hand)

If the inner loop is ~20 and if your vector width is 8, then on average, the scalar peel loop will iterate half the vector width, full vector will run the interior, but you may have a remainder to run in scalar. Note, if your instruction set is avx2 (I believe it is) then the mask variant of the instructions can be used for the peel and remainder. Thus potentially reducing the loop to three or four operations.

The only disadvantage is the ...rp and ...ip (and ...1) arrays now consume 2x the cache.

It is hard to tell without making test runs as to if the increased vectorization will be better than loss of some cache.

Jim Dempsey

unrue · ‎03-24-2016

Could you check if I understood well what you suggest?

              float * restrict num_rp = &num_r[0];
              float * restrict num_ip = &num_i[0];

              float * restrict traces_rp  = &traces.r[0];
              float * restrict traces_rp1 = &traces_copy.r[1];
              float * restrict traces_ip  = &traces.i[0];
              float * restrict traces_ip1 = &traces_copy.i[1];
              
              int i_iter = i_idx2 - i_idx1; // # iterations
              // *** use loop control variable and index that is scoped inside the for loop
              int k;
              for(k = i_idx1; k < i_idx2; k++) {
                  num_rp[k - i_idx1] = num_rp[k - i_idx1] + (1.0f-up_interp)*(traces_rp) + up_interp*traces_rp1;
                  num_ip[k - i_idx1] = num_ip[k - i_idx1] + (1.0f-up_interp)*(traces_ip) + up_interp*traces_ip1;
              }

jimdempseyatthecove · ‎03-24-2016

No,

// num_r is an array of floats that are cache line aligned
// num_i is an array of ints that are cach line aligned
// traces_r is an array of floats that are cache line aligned
// traces_r1 is an array of floats that are cache line aligned...
//        and contains a floats of traces_r[1] through traces_r[last_r] (whatever last_r is)
// traces_i is an array of ints that are cache line aligned
// traces_i1 is an array of ints that are cache line aligned...
//        and contains a ints of traces_i[1] through traces_i[last_i] (whatever last_i is)

for(int k = i_idx1; k <= i_idx2; k++) {
    num_r = num_r + (1.0f-up_interp)*(traces_r) + up_interp*traces_r1;
    num_i = num_i + (1.0f-up_interp)*(traces_i) + up_interp*traces_i1;
}

The above will require at least 7 GP registers, therefore you may need to perform the above in two loops (when in 32-bit mode, 64-bit should be ok),

for(int k = i_idx1; k <= i_idx2; k++) {
    num_r = num_r + (1.0f-up_interp)*(traces_r) + up_interp*traces_r1;
}
for(int k = i_idx1; k <= i_idx2; k++) {
    num_i = num_i + (1.0f-up_interp)*(traces_i) + up_interp*traces_i1;
}

Jim Dempsey

Shifted load