Why both Vectorized (Remainder) && Remainder ?

velvia · ‎01-28-2016

Hi,

In Intel Advisor 2016, one of my loop generate the following report

- Vectorized (Body), Self time = 0.104s

- Vectorized (Remainder), Self time = 0.076s

- Remainder, Self time = 0.040 s

I have managed to artificially increase the size of the arrays so the code does not spend any time in the remainder loop. But I am still quite puzzled by this report. I understand that remainder loops are usually serial and that they might be vectorized (my guess is that they use masked operations). But I don't understand at all why the code can spend time in both a "vectorized" remainder and a "serial" remainder for the same loop.

Thanks for your lights on that matter.

TimP · ‎01-29-2016

Depending on your target ISA and compiler cost analysis, the vectorized remainder may take care only of vector length sized array sections, with the smaller remainder still using serial execution. Masked stores, even if available, may be slower for some targets than the serial code. Also, use of masked stores may depend on setting the safe-padding option.

velvia · ‎01-29-2016

Hi Tim,

I think that I got it. Correct me if I am wrong. There are 2 kinds of remainders :

- One which is due to the fact that the vectorized body loop is unrolled.

- One which is due to the fact the the number of iterations of the loop is not a multiple of a vector length

The first one is usually vectorized whereas the second one is usually not.

Zakhar_M_Intel1 · ‎01-29-2016

Velvia,

Let's forget about unrolling for a moment, because vectorized remainders could easily appear in cases when no unrolling took place. When there is no unrolling (and no peeling/alignment-issues) total number of iterations in Remainder(s) is always equal to

Rem_Iterations = Original_Number_Iterations modulo Vector_Length.

If Rem_Iterations is know to be small (let's say known in advance to be 1 or 2) and/or expected vectorization speed-up is very small (let's say 1.1x) , then Compiler will normally choose to generate single scalar remainder.
If Rem_Iterations is big enough and/or vectorization speed-up is expected to be high even for vector length of 2, then Compiler will often try to further "split" remainder into 2 or more parts, for example so that Rem_Iterations = Vectorized_Remainder_Iterations*[Vector_Length/2] + Scalar_Remainder_Iterations. In other words, compiler will generate several remainders (in kind of "waterfall" manner) to make at least some of them exploiting vectorization.

One more way to think of that. Imagine loop with even number of iterations, e.g. 30 . Theoretically nothing prevents to execute all iterations of loop in vectorized body or remainders with vector legnth== 2. Furthermore you could do smarter and divide loop into 4 parts with 16, 8, 4 and 2 iterations respectively (16 + 8 +4 +2 =30). Now you can vectorize each loop with individual vector lengths of 16, 8, 4 and 2 respectively. However this approach will not always pay off, because the vectorization speed-up will often be insufficient to amortize "overhead" from vectorization for loops with too small vector length. That's why Compiler don't always use remainders vectorization technique.

But for loops with huge vectorization speed-ups and huge vector lengths it's pretty much always profitable to try to vectorize remainders with half- or quarter- vector-length of main loop body.

Last note: for AVX-512 (coming in 2nd generation Xeon Phi and future Intel Xeon) - given considerations will become slightly incorrect, but still generally applicable. The difference of AVX-512 is availability of mask registers which impacts Remainders topic. Whatever the case, you will often see vectorized and scalar remainders in AVX and AVX2 codes, and presumably you will continue seeing them in AVX-512 codes.

Intel Advisor Recommendations take in mind all considerations described above (for SSE,AVX,AVX2 and very soon for AVX-512), enrich compiler opt-report data with dynamic hotspots and dynamic trip counts knowledge, and automatically emits suggestions on how to improve performance of given code. Quality of corresponding Advisor Recommendations has been significantly improved after Intel Advisor 2016 Update1, so I'd recommend you to use Update2 (released in December) /Update3 (to be released very soon) versions.

velvia · ‎02-01-2016

Hi Zakhar,

Thanks for your explanation. But there a few things that puzzle me.

In order to be concrete, suppose that we a current generation Xeon that supports AVX2. The vector width is 256 bits. Then, for a "float" whose size is 32 bits, the vector length is 8 (as 256 = 8 * 32). The way I understand it, is that the vector length is fixed once you have the platform and the type. In that case, the vector length is 8. But you seem to indicate otherwise:

One more way to think of that. Imagine loop with even number of iterations, e.g. 30 . Theoretically nothing prevents to execute all iterations of loop in vectorized body or remainders with vector legnth== 2. Furthermore you could do smarter and divide loop into 4 parts with 16, 8, 4 and 2 iterations respectively (16 + 8 +4 +2 =30). Now you can vectorize each loop with individual vector lengths of 16, 8, 4 and 2 respectively.

The way I understand it is that you use multiple vector length for the same data type, which I don't understand. Could you please elaborate on that?

Maybe once you have a "natural" vector length (here 8), you would be able to use any vector length of 8 / 2^i such as 8, 4, 2. Is that correct? Do you have any concrete example where vectorization of the remainder works in a "waterfall" way?

Best regards.

Zakhar_M_Intel1 · ‎02-03-2016

Velvia,

Loop Vector Length is basically equal to number of scalar iterations corresponding to single vector iteration. Compilers can easily use Loop Vector Length bigger or smaller than "Natural Vector Length" for given combination of ISA and data type.

When VL_acual < VL_natural , then Compiler just uses part of vector registers (let's say half register or quarter register). When VL_actual > VL_natural, then Compiler uses techniques like "multi-pumping" (something relatively similar to unroll, where the same instruction is repeated few times).

Of course, in ideal world, VL_actual should be equal to VL_natural, because this is what delivers the best performance and efficiency. But by multiple reasons it could be required/more profitable to use VL_actual != VL_natural. Some of these reasons are:

- Usage of data types of different size in the same computational loops. In such case VL_natural is simply undefined for the loop as a whole.

- Data dependence between some (but not all) iterations. This is where #pragma omp simd safelen is involved (and therefore oftenl VL_natural > VL_actual).

- User' requested to use alternate vector length (just users wanted to do this way by some reason): #pragma omp simd simdlen.

- Not enough iterations to use natural vector length. This is our case.

On AVX-512 you will mostly always see scalar + vectorized remainder. On AVX2 you will see it with vectorized loops where trip count is, e.g, 30, while VL_natural = 16. On AVX(1) and SSE you will not see it .

One typical code of this nature (DL_MESO) is described at https://software.intel.com/en-us/articles/get-a-helping-hand-from-the-vectorization-advisor (there is also chapter in Pearls-2 book; you can also download some variant of this code from Pearls book web site). While article focuses mostly on AVX variant, the AVX2 vectorized version will often have combination of scalar and vectorized remainders. This is definitely not the simplest example of that kind, I just talked about it for reference.

Pure "waterfall" (i.e. sequence of many vectorized and scalar remainders) are very rarely seen, patially because having TOO much remainders has another performance drawbacks. But seeing 2 or maximum 3 remainders (where last one is scalar) is pretty common for modern ISAs.

velvia · ‎02-03-2016

Zakhar,

Thanks for your time and your lights on the subject.