topic Hi Jim, i did a terrible in Intel® Moderncode for Parallel Architectures

Shifted load

unrue — Fri, 11 Mar 2016 10:29:35 GMT

Deat Intel developers,

I'm using AVX whit Intel 15.0.1 compiler. I need to load some float shifted by one in order to do some AVX operations:

sample = traces.r + traces.r[j+1];

At the moment, I do two distinct AVX load:

intr_1= _mm256_load_ps(&traces.r);
intr_2= _mm256_load_ps(&traces.r[j+1]);

In this way, I calculate sample in two distinct vectorized phase, first part for J and second part for J+1. It works, but it is quiet slow. In fact, using the second load, I load seven elements I already have and just one new element. So, maybe I can do a sort of load, and left shift and finally a second load of one elements. What is the best strategy? Thanks in advance.

Your method with unaligned

TimP — Fri, 11 Mar 2016 11:06:01 GMT

Your method with unaligned loads is probably satisfactory for current CPU architecture (Haswell or newer). For Sandy Bridge, unaligned load will be intolerably slow, so you may want to consider permitting your compiler to choose the method, if you don't like to write in _mm128 loads to be combined by _mm256_insertf128_ps. AVX2 permutes appear better suited to your suggestion about shifts, but then you shouldn't be seeing such poor performance of unaligned load.

Ivy Bridge was designed specifically to alleviate the penalty incurred by unaligned mm256 loads.

The question seems more topical for https://software.intel.com/en-us/forums/intel-isa-extensions, but there is limited appeal in struggling with low level intrinsics code to optimize for an old ISA. After all, there is "moderncode" in the title of this forum.

Hi Tim,

unrue — Fri, 11 Mar 2016 11:42:39 GMT

Hi Tim,

I'm using Sandy Bridge andy my load are aligned (traces is 32 bytes aligned by using mm_malloc). But I've noted no differences between aligned or unaligned load.

If traces[n].r[0] is aligned

jimdempseyatthecove — Fri, 11 Mar 2016 17:54:36 GMT

If traces.r[0] is aligned to 32-bytes, then

_mm256_load_ps(&traces.r[0]) is aligned load of r[0], r[1],... r[7] and
_mm256_load_ps(&traces.r[1]) is unaligned, a split load of r[1],... r[7] and then r[8].

Jim Dempsey

It is almost always better to

McCalpinJohn — Fri, 11 Mar 2016 18:08:38 GMT

It is almost always better to reload the data than to use the permute functions, even with the cache-line-crossing and page-crossing penalties on Sandy Bridge. It is possible to make the page-crossing cases run faster by using permute, but it is a lot of effort to generate special versions for all possible alignments.

With avx load, performance

TimP — Fri, 11 Mar 2016 19:30:30 GMT

With avx load, performance varies with data alignment but not whether aligned or unaligned intrinsic is issued. mm128 loads may not lose performance with misalignment.

Quote:jimdempseyatthecove

unrue — Sat, 12 Mar 2016 08:59:00 GMT

jimdempseyatthecove wrote:

If traces.r[0] is aligned to 32-bytes, then

_mm256_load_ps(&traces.r[0]) is aligned load of r[0], r[1],... r[7] and
_mm256_load_ps(&traces.r[1]) is unaligned, a split load of r[1],... r[7] and then r[8].

Jim Dempsey

Hi Jim, this is true, in fact, I do a manual unroll of step 8:

for(j,........, j+=8) {

_mm256_load_ps(&traces.r) 
_mm256_load_ps(&traces.r[j+1]) 

}

I have for each iteration one load aligned and one not. How can I do to have all load aligned?

I think it would be more

jimdempseyatthecove — Sat, 12 Mar 2016 13:13:38 GMT

I think it would be more productive for you and the responders to see a larger picture of what you are trying to do.

From the little of what you have shown, you last post contains too little of what you are trying to do for any of us to offer productive advice. We do not see the outputs. IOW, are you producing a vector, one element shorter than the input where each output elements has each element the sum of the adjacent input vector? Or are you intending to perform a sum reduction (sum of all elements in the input vector)? Or something entirely different. If the first case, is the output vector the input vector, if so, can this be made to different output vector?

To answer your question, sketch out what you are asking:

cache lines |0123456789ABCDEF|0123456789ABCDEF|0123456789ABCDEF|...
j=0 _mm256   01234567        |                |
j+1 _mm256    01234567       |                |
j=8 _mm256           01234567|                |
j+1 _mm256            0123456|7               |
j=16 _mm256                  |01234567        |
j+1  _mm256                  | 01234567       |
...

Where | is the cache line interval. When mod(j,16)==0 then j+1 is within the same cache line as j, however, when mod(j,16)==8, then j+1 will cross cache line.

Good advice for the narrow scope is likely to provide bad advice for the larger scope.

Jim Dempsey

Hi Jim, thanks for the

unrue — Mon, 14 Mar 2016 08:08:46 GMT

Hi Jim,

thanks for the explanations. I post more source code to explain better my goal:

for(n = *begin_ntrace; n < *end_ntrace; n++) {

    r_idx   = tt;
    i_idx1  = (int)r_idx;
    i_idx2  = i_idx1 + dms;

    for(j = i_idx1, k = 0; j < i_idx2; j++, k++) {
        sample.r = traces.r + traces.r[j+1];
        sample.i = traces.i + traces.i[j+1];
   
        num.r += sample.r;
        num.i += sample.i;

    }

}

where appears j and j+1 memory access.

Now you have changed the

TimP — Mon, 14 Mar 2016 14:12:00 GMT

Sorry, I misread the posted code as nested loops, seemingly implying a sum reduction. I'm concerned about writing source code with multiple variables in the for() field not only due to this possibility of misreading, but also an account of needing to always read compiler reports to see the effect. If you happen to be compiling in 32-bit mode, it's of utmost importance that the compiler combine multiple pointer references into a single indexing register with offsets. A confirmation of vectorization would be sufficient to indicate this has happened, but use of simd intrinsics may obscure it.

As you have mentioned the hope that keeping the duplicate memory references at register level might boost performance, I would like to mention that this seems most likely to work for data in L1 cache, when you might be coming up against the Sandy Bridge limit of 2 128-bit reads and 1 128-bit store to L1 cache per clock cycle. If your data runs are long enough to go to memory, the combining effect of last level cache and read stream buffers should prevent repeated memory references from absorbing memory bandwidth.

TimP,

jimdempseyatthecove — Mon, 14 Mar 2016 15:27:26 GMT

TimP,

This isn't a sum reduction, it is more like a raster (or bin) accumulation. The output isn't a scalar sum, rather it is a vector of sums of vectors (a vector and its +1 neighbor).

What is a representative value for dms? IOW what is a representative trip count for the inner loop?

Do you have a pattern for i_idx1 for each n? IOW is it always a multiple of some number or is it somewhat random?

A major inhibitor of vectorization for the above code is the output array num as being a structure. It would be better if you made two arrays, and accumulated as

for(n = *begin_ntrace; n < *end_ntrace; n++) {
    r_idx   = tt;
    i_idx1  = (int)r_idx;
    i_idx2  = i_idx1 + dms;
    for(j = i_idx1, k = 0; j < i_idx2; j++, k++) {
        num_r += traces.r + traces.r[j+1]; // vector of r to vector of r
        num_i += traces.i + traces.i[j+1]; // vector of i to vector of i
    }
}

Jim Dempsey

And if need be, follow the

jimdempseyatthecove — Mon, 14 Mar 2016 15:32:16 GMT

And if need be, follow the above with:

for(k=0; k<dms; ++k) {
  num.r = num_r;
  num.i = num_i;
}

Jim Dempsey

Quote:jimdempseyatthecove]

unrue — Mon, 14 Mar 2016 15:33:00 GMT

jimdempseyatthecove wrote:

TimP,

This isn't a sum reduction, it is more like a raster (or bin) accumulation. The output isn't a scalar sum, rather it is a vector of sums of vectors (a vector and its +1 neighbor).

jimdempseyatthecove wrote:

What is a representative value for dms? IOW what is a representative trip count for the inner loop?

Typical value of dms is from 10 to 30 and around this.

jimdempseyatthecove wrote:

Do you have a pattern for i_idx1 for each n? IOW is it always a multiple of some number or is it somewhat random?

Unfortunately not, no pattern are associated to i_idx1. Typical trip count is about 20. It is very small loop called many and many times inside a function.
jimdempseyatthecove wrote:

A major inhibitor of vectorization for the above code is the output array num as being a structure. It would be better if you made two arrays, and accumulated as
for(n = *begin_ntrace; n < *end_ntrace; n++) {
    r_idx   = tt;
    i_idx1  = (int)r_idx;
    i_idx2  = i_idx1 + dms;
    for(j = i_idx1, k = 0; j < i_idx2; j++, k++) {
        num_r += traces.r + traces.r[j+1]; // vector of r to vector of r
        num_i += traces.i + traces.i[j+1]; // vector of i to vector of i
    }
}
Jim Dempsey
Thanks Jim, I'll try it.

I don't have all of the

McCalpinJohn — Mon, 14 Mar 2016 19:29:46 GMT

I don't have all of the numbers handy, but if I recall correctly the Sandy Bridge core pays a 3 cycle penalty for 256-bit loads that cross a cache line boundary and a 1-cycle penalty for 128-bit loads that cross a cache line boundary. So you definitely don't want 256-bit loads in this case.

It is not obvious to me whether you will be better off with (128-bit loads/stores and 128-bit arithmetic) or with (128-bit loads/stores and 256-bit arithmetic). There will probably not be a lot of difference in this case -- the execution time should be limited by the loads and stores, with plenty of time to fully overlap the arithmetic instruction execution.

Quote:jimdempseyatthecove

unrue — Tue, 15 Mar 2016 12:56:31 GMT

jimdempseyatthecove wrote:

And if need be, follow the above with:
for(k=0; k<dms; ++k) {
  num.r = num_r;
  num.i = num_i;
}
Jim Dempsey

Hi Jim,

using your suggestion, code is about 15% faster.

That is a good start.

jimdempseyatthecove — Tue, 15 Mar 2016 15:33:23 GMT

That is a good start.

Next adventure for you is to run the release code under VTune, not necessarily to find the bottlenecks (at this time). Instead, use Bottom Up listing, find the function, then open the Assembly window.

You do not need to fully understand the Assembly instruction syntax. You just need to understand some simple rules:

a) In the source window, highlight the lines of interest (e.g. the for(j= loop in post #11, select/highlight lines 5:8)
b) Locate the highlighted lines in the Assembly window (you may need to sort by source line number by clicking on the "Line" header in the Assembly window).
c) Assess the number of assembly instructions, after a few of these exercizes you will get a feel of "too many" or "about right". One of the issues that can arrise is loop invarient code not getting moved out of the loop. For this loop, you would expect the compiler to locate "num_r", "num_i", "traces.r" and "traces.i" into registers. If you see memory fetches (those with "ptr [...]" then these addresses are not registerized. If the inner loop does not have the array addresses registerized, then you may need to add within the "for(n=" loop, scoped restrict pointers to those arrays, code to use them, VTune and examine Assembly again.
d) While one can use the vectorization report, I prefer to see the Assembly to see if there is "efficient" vectorization. In this case you are looking for the xmm or ymm instructions (xmm is 128 bit, ymm is 256 bit). Note, John McCalpin pointed out that on some CPU's 128 bit can be faster. And more importantly the instructions end in ...ps or ...pd for floating point (packed single, packed double), or are of the form v... or p... for integer.

After you get the instruction count down to "about right", then you can look at the metrics offered by VTune to see if there is something else you can learn. For example, if you are building x32 applications, you have a limited number of registers available. If the for(j= loop runs out of registers, it may have to load the array addresses from memory. In this situation, making two loops, one for r and one for i, may be faster. On x64, the register usage on the for(j= should not be an issue (verify this assumption by looking at the assembly code).

You do not need to the the Assembly checkout for all code, just your few hot spots.

Last note:

It has been my experience that when you manage to wipe out a hot spot (making the loops run multiples of times faster), that you almost always fine a now new hot spot that didn't seem all that important before. Therefore, expect to iterate working away at a series of new hot spots.

Jim Dempsey

BTW for

jimdempseyatthecove — Tue, 15 Mar 2016 15:41:21 GMT

BTW for

    for(j = i_idx1, k = 0; j < i_idx2; j++, k++) {
        num_r += traces.r + traces.r[j+1]; // vector of r to vector of r
        num_i += traces.i + traces.i[j+1]; // vector of i to vector of i
    }

You would expect to see 6 "... ptr[...]" instructions (4 reads, 2 writes). Though you may see double this if the compiler unrolled it once (or multiples with higher unroll count).

Jim Dempsey

Quote:jimdempseyatthecove

unrue — Wed, 16 Mar 2016 08:52:49 GMT

jimdempseyatthecove wrote:

That is a good start.

Next adventure for you is to run the release code under VTune, not necessarily to find the bottlenecks (at this time). Instead, use Bottom Up listing, find the function, then open the Assembly window.

You do not need to fully understand the Assembly instruction syntax. You just need to understand some simple rules:

a) In the source window, highlight the lines of interest (e.g. the for(j= loop in post #11, select/highlight lines 5:8)
b) Locate the highlighted lines in the Assembly window (you may need to sort by source line number by clicking on the "Line" header in the Assembly window).
c) Assess the number of assembly instructions, after a few of these exercizes you will get a feel of "too many" or "about right". One of the issues that can arrise is loop invarient code not getting moved out of the loop. For this loop, you would expect the compiler to locate "num_r", "num_i", "traces.r" and "traces.i" into registers. If you see memory fetches (those with "ptr [...]" then these addresses are not registerized. If the inner loop does not have the array addresses registerized, then you may need to add within the "for(n=" loop, scoped restrict pointers to those arrays, code to use them, VTune and examine Assembly again.
d) While one can use the vectorization report, I prefer to see the Assembly to see if there is "efficient" vectorization. In this case you are looking for the xmm or ymm instructions (xmm is 128 bit, ymm is 256 bit). Note, John McCalpin pointed out that on some CPU's 128 bit can be faster. And more importantly the instructions end in ...ps or ...pd for floating point (packed single, packed double), or are of the form v... or p... for integer.

After you get the instruction count down to "about right", then you can look at the metrics offered by VTune to see if there is something else you can learn. For example, if you are building x32 applications, you have a limited number of registers available. If the for(j= loop runs out of registers, it may have to load the array addresses from memory. In this situation, making two loops, one for r and one for i, may be faster. On x64, the register usage on the for(j= should not be an issue (verify this assumption by looking at the assembly code).

You do not need to the the Assembly checkout for all code, just your few hot spots.

Last note:

It has been my experience that when you manage to wipe out a hot spot (making the loops run multiples of times faster), that you almost always fine a now new hot spot that didn't seem all that important before. Therefore, expect to iterate working away at a series of new hot spots.

Jim Dempsey

Hi Jim

I attach the assembly from VTune. From what I understand, there are some values not registerized so I tried this code by using restrict ponters:

         float * restrict num_rp = &num_r[0];
         float * restrict num_ip = &num_i[0];
         float * restrict traces_rp = &traces.r[i_idx1];
         float * restrict traces_ip = &traces.i[i_idx1];

         for(j = i_idx1, k = 0; j < i_idx2; j++, k++) {
             *num_rp = *num_rp + (1.0f-up_interp)*(*traces_rp) + up_interp*(*(traces_rp+1));
             *num_ip = *num_ip + (1.0f-up_interp)*(*traces_ip) + up_interp*(*(traces_ip+1));
              num_rp++;
              num_ip++;
              traces_rp++;
              traces_ip++;
         }

but I obtained no performance gain.

It's hard to read from your

TimP — Wed, 16 Mar 2016 17:15:00 GMT

It's hard to read from your screen shots; I would think that saving asm code and finding the hot loops might be easier. From what I can see, it does seem that the compiler may not have combined your pointers into loop carried register variables; as you appear to need 6 integer or pointer registers, failure to perform such combination would prevent optimization for 32-bit mode, where at most 3 such registers are available. Avoiding this trap appears to be among the advantages, besides readability, of array indexing notation num_rp ....

Intel C++ has a habit anyway of failing to optimize where there are too many of those post-increments at the bottom of the loop.

The compiler usually

jimdempseyatthecove — Wed, 16 Mar 2016 19:59:00 GMT

The compiler usually generates much better code using subscripted restrict pointes as opposed to pointer++, besides, incrementing one index register is faster than indexing 4 pointers. Use this:

float * restrict num_rp = &num_r[0];
float * restrict num_ip = &num_i[0];
float * restrict traces_rp = &traces.r[i_idx1];
float * restrict traces_ip = &traces.i[i_idx1];
int i_iter = i_idx2 - i_idx1 + 1; // # iterations
// *** use loop control variable and index that is scoped inside the for loop
for(int k = 0; k < i_itr; k++) {
    num_rp = num_rp + (1.0f-up_interp)*(traces_rp) + up_interp*traces_rp[k+1];
    num_ip = num_ip + (1.0f-up_interp)*(traces_ip) + up_interp*traces_ip[k+1];
}

On 32-bit system, the above should require 4 registers for your float*'s and 1 register each for i_iter and k for a total of 6 GP registers.

If the above does not fully registerize (usually due to the compiler thinking something outside the scope of what shown above is more important) then enclose the above in {...}. Yes, this do would appear to be meaningless, however, the meaning is that the 4 float*'s and the variable i_iter are disposable (not used outside the scope). This can improve register usage, especially on x32 builds.

Jim Dempsey