Shifted load

unrue · ‎03-11-2016

Deat Intel developers,

I'm using AVX whit Intel 15.0.1 compiler. I need to load some float shifted by one in order to do some AVX operations:

sample = traces.r + traces.r[j+1];

At the moment, I do two distinct AVX load:

intr_1= _mm256_load_ps(&traces.r);
intr_2= _mm256_load_ps(&traces.r[j+1]);

In this way, I calculate sample in two distinct vectorized phase, first part for J and second part for J+1. It works, but it is quiet slow. In fact, using the second load, I load seven elements I already have and just one new element. So, maybe I can do a sort of load, and left shift and finally a second load of one elements. What is the best strategy? Thanks in advance.

TimP · ‎03-11-2016

Your method with unaligned loads is probably satisfactory for current CPU architecture (Haswell or newer). For Sandy Bridge, unaligned load will be intolerably slow, so you may want to consider permitting your compiler to choose the method, if you don't like to write in _mm128 loads to be combined by _mm256_insertf128_ps. AVX2 permutes appear better suited to your suggestion about shifts, but then you shouldn't be seeing such poor performance of unaligned load.

Ivy Bridge was designed specifically to alleviate the penalty incurred by unaligned mm256 loads.

The question seems more topical for https://software.intel.com/en-us/forums/intel-isa-extensions, but there is limited appeal in struggling with low level intrinsics code to optimize for an old ISA. After all, there is "moderncode" in the title of this forum.

unrue · ‎03-11-2016

Hi Tim,

I'm using Sandy Bridge andy my load are aligned (traces is 32 bytes aligned by using mm_malloc). But I've noted no differences between aligned or unaligned load.

jimdempseyatthecove · ‎03-11-2016

If traces.r[0] is aligned to 32-bytes, then

_mm256_load_ps(&traces.r[0]) is aligned load of r[0], r[1],... r[7] and
_mm256_load_ps(&traces.r[1]) is unaligned, a split load of r[1],... r[7] and then r[8].

Jim Dempsey

McCalpinJohn · ‎03-11-2016

It is almost always better to reload the data than to use the permute functions, even with the cache-line-crossing and page-crossing penalties on Sandy Bridge. It is possible to make the page-crossing cases run faster by using permute, but it is a lot of effort to generate special versions for all possible alignments.

TimP · ‎03-11-2016

With avx load, performance varies with data alignment but not whether aligned or unaligned intrinsic is issued. mm128 loads may not lose performance with misalignment.

unrue · ‎03-12-2016

jimdempseyatthecove wrote:

If traces.r[0] is aligned to 32-bytes, then

_mm256_load_ps(&traces.r[0]) is aligned load of r[0], r[1],... r[7] and
_mm256_load_ps(&traces.r[1]) is unaligned, a split load of r[1],... r[7] and then r[8].

Jim Dempsey

Hi Jim, this is true, in fact, I do a manual unroll of step 8:

for(j,........, j+=8) {

_mm256_load_ps(&traces.r) 
_mm256_load_ps(&traces.r[j+1]) 

}

I have for each iteration one load aligned and one not. How can I do to have all load aligned?

jimdempseyatthecove · ‎03-12-2016

I think it would be more productive for you and the responders to see a larger picture of what you are trying to do.

From the little of what you have shown, you last post contains too little of what you are trying to do for any of us to offer productive advice. We do not see the outputs. IOW, are you producing a vector, one element shorter than the input where each output elements has each element the sum of the adjacent input vector? Or are you intending to perform a sum reduction (sum of all elements in the input vector)? Or something entirely different. If the first case, is the output vector the input vector, if so, can this be made to different output vector?

To answer your question, sketch out what you are asking:

cache lines |0123456789ABCDEF|0123456789ABCDEF|0123456789ABCDEF|...
j=0 _mm256   01234567        |                |
j+1 _mm256    01234567       |                |
j=8 _mm256           01234567|                |
j+1 _mm256            0123456|7               |
j=16 _mm256                  |01234567        |
j+1  _mm256                  | 01234567       |
...

Where | is the cache line interval. When mod(j,16)==0 then j+1 is within the same cache line as j, however, when mod(j,16)==8, then j+1 will cross cache line.

Good advice for the narrow scope is likely to provide bad advice for the larger scope.

Jim Dempsey

unrue · ‎03-14-2016

Hi Jim,

thanks for the explanations. I post more source code to explain better my goal:

for(n = *begin_ntrace; n < *end_ntrace; n++) {

    r_idx   = tt;
    i_idx1  = (int)r_idx;
    i_idx2  = i_idx1 + dms;

    for(j = i_idx1, k = 0; j < i_idx2; j++, k++) {
        sample.r = traces.r + traces.r[j+1];
        sample.i = traces.i + traces.i[j+1];
   
        num.r += sample.r;
        num.i += sample.i;

    }

}

where appears j and j+1 memory access.

TimP · ‎03-14-2016

Sorry, I misread the posted code as nested loops, seemingly implying a sum reduction. I'm concerned about writing source code with multiple variables in the for() field not only due to this possibility of misreading, but also an account of needing to always read compiler reports to see the effect. If you happen to be compiling in 32-bit mode, it's of utmost importance that the compiler combine multiple pointer references into a single indexing register with offsets. A confirmation of vectorization would be sufficient to indicate this has happened, but use of simd intrinsics may obscure it.

As you have mentioned the hope that keeping the duplicate memory references at register level might boost performance, I would like to mention that this seems most likely to work for data in L1 cache, when you might be coming up against the Sandy Bridge limit of 2 128-bit reads and 1 128-bit store to L1 cache per clock cycle. If your data runs are long enough to go to memory, the combining effect of last level cache and read stream buffers should prevent repeated memory references from absorbing memory bandwidth.

jimdempseyatthecove · ‎03-14-2016

TimP,

This isn't a sum reduction, it is more like a raster (or bin) accumulation. The output isn't a scalar sum, rather it is a vector of sums of vectors (a vector and its +1 neighbor).

What is a representative value for dms? IOW what is a representative trip count for the inner loop?

Do you have a pattern for i_idx1 for each n? IOW is it always a multiple of some number or is it somewhat random?

A major inhibitor of vectorization for the above code is the output array num as being a structure. It would be better if you made two arrays, and accumulated as

for(n = *begin_ntrace; n < *end_ntrace; n++) {
    r_idx   = tt;
    i_idx1  = (int)r_idx;
    i_idx2  = i_idx1 + dms;
    for(j = i_idx1, k = 0; j < i_idx2; j++, k++) {
        num_r += traces.r + traces.r[j+1]; // vector of r to vector of r
        num_i += traces.i + traces.i[j+1]; // vector of i to vector of i
    }
}

Jim Dempsey

jimdempseyatthecove · ‎03-14-2016

And if need be, follow the above with:

for(k=0; k<dms; ++k) {
  num.r = num_r;
  num.i = num_i;
}

Jim Dempsey

unrue · ‎03-14-2016

jimdempseyatthecove wrote:

TimP,

This isn't a sum reduction, it is more like a raster (or bin) accumulation. The output isn't a scalar sum, rather it is a vector of sums of vectors (a vector and its +1 neighbor).

jimdempseyatthecove wrote:

What is a representative value for dms? IOW what is a representative trip count for the inner loop?

Typical value of dms is from 10 to 30 and around this.

jimdempseyatthecove wrote:

Do you have a pattern for i_idx1 for each n? IOW is it always a multiple of some number or is it somewhat random?

Unfortunately not, no pattern are associated to i_idx1. Typical trip count is about 20. It is very small loop called many and many times inside a function.
jimdempseyatthecove wrote:

A major inhibitor of vectorization for the above code is the output array num as being a structure. It would be better if you made two arrays, and accumulated as
for(n = *begin_ntrace; n < *end_ntrace; n++) {
    r_idx   = tt;
    i_idx1  = (int)r_idx;
    i_idx2  = i_idx1 + dms;
    for(j = i_idx1, k = 0; j < i_idx2; j++, k++) {
        num_r += traces.r + traces.r[j+1]; // vector of r to vector of r
        num_i += traces.i + traces.i[j+1]; // vector of i to vector of i
    }
}
Jim Dempsey
Thanks Jim, I'll try it.

McCalpinJohn · ‎03-14-2016

I don't have all of the numbers handy, but if I recall correctly the Sandy Bridge core pays a 3 cycle penalty for 256-bit loads that cross a cache line boundary and a 1-cycle penalty for 128-bit loads that cross a cache line boundary. So you definitely don't want 256-bit loads in this case.

It is not obvious to me whether you will be better off with (128-bit loads/stores and 128-bit arithmetic) or with (128-bit loads/stores and 256-bit arithmetic). There will probably not be a lot of difference in this case -- the execution time should be limited by the loads and stores, with plenty of time to fully overlap the arithmetic instruction execution.

unrue · ‎03-15-2016

jimdempseyatthecove wrote:

And if need be, follow the above with:
for(k=0; k<dms; ++k) {
  num.r = num_r;
  num.i = num_i;
}
Jim Dempsey

Hi Jim,

using your suggestion, code is about 15% faster.

jimdempseyatthecove · ‎03-15-2016

That is a good start.

Next adventure for you is to run the release code under VTune, not necessarily to find the bottlenecks (at this time). Instead, use Bottom Up listing, find the function, then open the Assembly window.

You do not need to fully understand the Assembly instruction syntax. You just need to understand some simple rules:

a) In the source window, highlight the lines of interest (e.g. the for(j= loop in post #11, select/highlight lines 5:8)
b) Locate the highlighted lines in the Assembly window (you may need to sort by source line number by clicking on the "Line" header in the Assembly window).
c) Assess the number of assembly instructions, after a few of these exercizes you will get a feel of "too many" or "about right". One of the issues that can arrise is loop invarient code not getting moved out of the loop. For this loop, you would expect the compiler to locate "num_r", "num_i", "traces.r" and "traces.i" into registers. If you see memory fetches (those with "ptr [...]" then these addresses are not registerized. If the inner loop does not have the array addresses registerized, then you may need to add within the "for(n=" loop, scoped restrict pointers to those arrays, code to use them, VTune and examine Assembly again.
d) While one can use the vectorization report, I prefer to see the Assembly to see if there is "efficient" vectorization. In this case you are looking for the xmm or ymm instructions (xmm is 128 bit, ymm is 256 bit). Note, John McCalpin pointed out that on some CPU's 128 bit can be faster. And more importantly the instructions end in ...ps or ...pd for floating point (packed single, packed double), or are of the form v... or p... for integer.

After you get the instruction count down to "about right", then you can look at the metrics offered by VTune to see if there is something else you can learn. For example, if you are building x32 applications, you have a limited number of registers available. If the for(j= loop runs out of registers, it may have to load the array addresses from memory. In this situation, making two loops, one for r and one for i, may be faster. On x64, the register usage on the for(j= should not be an issue (verify this assumption by looking at the assembly code).

You do not need to the the Assembly checkout for all code, just your few hot spots.

Last note:

It has been my experience that when you manage to wipe out a hot spot (making the loops run multiples of times faster), that you almost always fine a now new hot spot that didn't seem all that important before. Therefore, expect to iterate working away at a series of new hot spots.

Jim Dempsey

jimdempseyatthecove · ‎03-15-2016

BTW for

    for(j = i_idx1, k = 0; j < i_idx2; j++, k++) {
        num_r += traces.r + traces.r[j+1]; // vector of r to vector of r
        num_i += traces.i + traces.i[j+1]; // vector of i to vector of i
    }

You would expect to see 6 "... ptr[...]" instructions (4 reads, 2 writes). Though you may see double this if the compiler unrolled it once (or multiples with higher unroll count).

Jim Dempsey

unrue · ‎03-16-2016

jimdempseyatthecove wrote:

That is a good start.

Next adventure for you is to run the release code under VTune, not necessarily to find the bottlenecks (at this time). Instead, use Bottom Up listing, find the function, then open the Assembly window.

You do not need to fully understand the Assembly instruction syntax. You just need to understand some simple rules:

a) In the source window, highlight the lines of interest (e.g. the for(j= loop in post #11, select/highlight lines 5:8)
b) Locate the highlighted lines in the Assembly window (you may need to sort by source line number by clicking on the "Line" header in the Assembly window).
c) Assess the number of assembly instructions, after a few of these exercizes you will get a feel of "too many" or "about right". One of the issues that can arrise is loop invarient code not getting moved out of the loop. For this loop, you would expect the compiler to locate "num_r", "num_i", "traces.r" and "traces.i" into registers. If you see memory fetches (those with "ptr [...]" then these addresses are not registerized. If the inner loop does not have the array addresses registerized, then you may need to add within the "for(n=" loop, scoped restrict pointers to those arrays, code to use them, VTune and examine Assembly again.
d) While one can use the vectorization report, I prefer to see the Assembly to see if there is "efficient" vectorization. In this case you are looking for the xmm or ymm instructions (xmm is 128 bit, ymm is 256 bit). Note, John McCalpin pointed out that on some CPU's 128 bit can be faster. And more importantly the instructions end in ...ps or ...pd for floating point (packed single, packed double), or are of the form v... or p... for integer.

After you get the instruction count down to "about right", then you can look at the metrics offered by VTune to see if there is something else you can learn. For example, if you are building x32 applications, you have a limited number of registers available. If the for(j= loop runs out of registers, it may have to load the array addresses from memory. In this situation, making two loops, one for r and one for i, may be faster. On x64, the register usage on the for(j= should not be an issue (verify this assumption by looking at the assembly code).

You do not need to the the Assembly checkout for all code, just your few hot spots.

Last note:

It has been my experience that when you manage to wipe out a hot spot (making the loops run multiples of times faster), that you almost always fine a now new hot spot that didn't seem all that important before. Therefore, expect to iterate working away at a series of new hot spots.

Jim Dempsey

Hi Jim

I attach the assembly from VTune. From what I understand, there are some values not registerized so I tried this code by using restrict ponters:

         float * restrict num_rp = &num_r[0];
         float * restrict num_ip = &num_i[0];
         float * restrict traces_rp = &traces.r[i_idx1];
         float * restrict traces_ip = &traces.i[i_idx1];

         for(j = i_idx1, k = 0; j < i_idx2; j++, k++) {
             *num_rp = *num_rp + (1.0f-up_interp)*(*traces_rp) + up_interp*(*(traces_rp+1));
             *num_ip = *num_ip + (1.0f-up_interp)*(*traces_ip) + up_interp*(*(traces_ip+1));
              num_rp++;
              num_ip++;
              traces_rp++;
              traces_ip++;
         }

but I obtained no performance gain.

TimP · ‎03-16-2016

It's hard to read from your screen shots; I would think that saving asm code and finding the hot loops might be easier. From what I can see, it does seem that the compiler may not have combined your pointers into loop carried register variables; as you appear to need 6 integer or pointer registers, failure to perform such combination would prevent optimization for 32-bit mode, where at most 3 such registers are available. Avoiding this trap appears to be among the advantages, besides readability, of array indexing notation num_rp ....

Intel C++ has a habit anyway of failing to optimize where there are too many of those post-increments at the bottom of the loop.

jimdempseyatthecove · ‎03-16-2016

The compiler usually generates much better code using subscripted restrict pointes as opposed to pointer++, besides, incrementing one index register is faster than indexing 4 pointers. Use this:

float * restrict num_rp = &num_r[0];
float * restrict num_ip = &num_i[0];
float * restrict traces_rp = &traces.r[i_idx1];
float * restrict traces_ip = &traces.i[i_idx1];
int i_iter = i_idx2 - i_idx1 + 1; // # iterations
// *** use loop control variable and index that is scoped inside the for loop
for(int k = 0; k < i_itr; k++) {
    num_rp = num_rp + (1.0f-up_interp)*(traces_rp) + up_interp*traces_rp[k+1];
    num_ip = num_ip + (1.0f-up_interp)*(traces_ip) + up_interp*traces_ip[k+1];
}

On 32-bit system, the above should require 4 registers for your float*'s and 1 register each for i_iter and k for a total of 6 GP registers.

If the above does not fully registerize (usually due to the compiler thinking something outside the scope of what shown above is more important) then enclose the above in {...}. Yes, this do would appear to be meaningless, however, the meaning is that the 4 float*'s and the variable i_iter are disposable (not used outside the scope). This can improve register usage, especially on x32 builds.

Jim Dempsey

jimdempseyatthecove · ‎03-16-2016

FWIW, the IA32 instruction set (as well as Intel64), has the capability of SIB (Scale, Index, Base)

Meaning for traces_rp[k+1] requires only 1 instruction to fetch or store or to use as one of the add/sub/div instructions.

In the above, the instruction would contain "0x04(%rBase, $rIndex, 4)" as either the source or destination of an instruction.

By using the pointer++ (4 times), you are complicating the issue.

Jim Dempsey