Solved: Vectorization Potential Seedup Calculation

Sharath_K_1 · ‎06-19-2015

Architecture: x86_64 (ivybridge with 8 cores)
Compiler Version: icc 15.0

How does the compiler calculate estimated potential speedup for a loop in vector report? How to find cache sizes? And How does aligned and unaligned access affect potential speedup value? can you please explain with regarding to report below?

for(index=0;index<SIZE;index++)
{
     array_A[index]=array_B[index]+array_C[index];
}

LOOP BEGIN at vector1.c(14,2)
   remark #15388: vectorization support: reference array_A has aligned access   [ vector1.c(16,3) ]
   remark #15388: vectorization support: reference array_B has aligned access   [ vector1.c(16,3) ]
   remark #15388: vectorization support: reference array_C has aligned access   [ vector1.c(16,3) ]
   remark #15399: vectorization support: unroll factor set to 4
   remark #15300: LOOP WAS VECTORIZED
   remark #15448: unmasked aligned unit stride loads: 2 
   remark #15449: unmasked aligned unit stride stores: 1 
   remark #15475: --- begin vector loop cost summary ---
   remark #15476: scalar loop cost: 6 
   remark #15477: vector loop cost: 5.000 
   remark #15478: estimated potential speedup: 4.800 
   remark #15479: lightweight vector operations: 5 
   remark #15488: --- end vector loop cost summary ---
LOOP END

Thanks in advance

Hideki_I_Intel · ‎06-19-2015

Please note that, within the compiler, vectorizer runs pretty early on.

As such, accuracy of vectorizer's cost modeling cannot be as good as, for example, the cost model for register allocator and instruction scheduler ---- since vectorizer does not see what'll happen downstream. Because of that, vectorizer's cost model is a very rough estimate, aiming for vectorization decision (vectorize or not, how many element per vector, how much to unroll further, etc.) to be reasonable for a broad range of code base. We don't try to be perfect on any particular code.

Regarding your example ---- there is a bug related to estimated speedup and unroll factor. Correct value to be printed out should be 1.2 (= 6 / 5). Because of the bug, it is multipled by unroll factor and became 4.8.

Now, how do we compute scalar and vector cost? Details of that is part of our intellectual property and as such I will not describe. You can try reverse engineer by modifying the code a little bit at a time and compute scalar/vector cost values.

About cache size ---- vectorizer's use of cache size is limited to non-temporal store generation. With today's multicore, different CPU SKUs and possibly parallel application code, and other apps running on the same CPU, it is difficult for vectorizer to estimate relative cache footprint at the time loop is running (for example, another app may be filling up the last level cache entirely). If vectorizer makes a mistake (I'd hope it's not too many), you can use "#pragma vector [non]temporal" to control it. If you need to query the cache size of the CPU you are using, you can use Intel Processor Identification Utility or other similar tools.

Hope this is helpful enough.

View solution in original post

TimP · ‎06-19-2015

If you allowed the instruction set target to fall back to an older one, the vector speedup calculation will not be at all accurate for a recent CPU. In particular, as you mention, the penalties for misalignment were much larger on past CPUs.

In some cases, there may be little penalty for unaligned loads, maybe even so little that the compiler will know not to peel for alignment.

I suppose the calculation has to be done with an assumed loop count (probably a large one, if no information is available at compile time).

You could find out yourself how the calculation is affected, for example by setting #pragma vector unaligned and setting the arrays at various alignments, as well as trying various -m settings. I wouldn't be totally surprised if the compiler were aware of the different AVX misalignment penalties of Sandy Bridge and Ivy Bridge even though the code generation may be the same between those architecture options. There may be no reason to use the Ivy Bridge option which could fail on Sandy Bridge except to hope for the updated assessment of misalignment.

I don't know whether you could infer the calculations by a detailed study of the architecture guides.

Recent compilers have become good at figuring out whether vectorization "seems inefficient." (Unless you assert simd, vector always, or use Cilk(tm) Plus notation, which over-ride the compiler's decision whether vectorization is desirable).

I don't know your specific questions about the report you show. It seems to indicate that the arrays are known to be aligned. Possibly, you may be using AVX with 64-bit data type, as it appears to use 4-wide simd in the speedup calculation (unrolled to get 4 results per loop iteration in scalar mode and 16 in vector mode). I suppose it's nearly load/store limited.

Hideki_I_Intel · ‎06-19-2015

Please note that, within the compiler, vectorizer runs pretty early on.

As such, accuracy of vectorizer's cost modeling cannot be as good as, for example, the cost model for register allocator and instruction scheduler ---- since vectorizer does not see what'll happen downstream. Because of that, vectorizer's cost model is a very rough estimate, aiming for vectorization decision (vectorize or not, how many element per vector, how much to unroll further, etc.) to be reasonable for a broad range of code base. We don't try to be perfect on any particular code.

Regarding your example ---- there is a bug related to estimated speedup and unroll factor. Correct value to be printed out should be 1.2 (= 6 / 5). Because of the bug, it is multipled by unroll factor and became 4.8.

Now, how do we compute scalar and vector cost? Details of that is part of our intellectual property and as such I will not describe. You can try reverse engineer by modifying the code a little bit at a time and compute scalar/vector cost values.

About cache size ---- vectorizer's use of cache size is limited to non-temporal store generation. With today's multicore, different CPU SKUs and possibly parallel application code, and other apps running on the same CPU, it is difficult for vectorizer to estimate relative cache footprint at the time loop is running (for example, another app may be filling up the last level cache entirely). If vectorizer makes a mistake (I'd hope it's not too many), you can use "#pragma vector [non]temporal" to control it. If you need to query the cache size of the CPU you are using, you can use Intel Processor Identification Utility or other similar tools.

Hope this is helpful enough.

KitturGanesh · ‎06-19-2015

Thanks Hideki! Sharat, BTW Hideki is our internal vectorization expert :)
_Kittur