Intel® C++ Compiler
Community support and assistance for creating C++ code that runs on platforms based on Intel® processors.
7944 Discussions

AVX2 vectorization: estimated potential speedup 4.5 for double precision

xwuupb
Novice
1,023 Views

Hi everyone,

here is a very simple for-loop:

double a[8192];
for (i = 0; i < 8192; ++i) {
    a[i] = i * 32.0 + 8.0;
}

 The compilation command is:

icc -O3 -xCORE-AVX2 -qopt-report-phase=all -qopt-report=5

The following is part of the optimization report:

remark #15305: vectorization support: vector length 4
remark #15300: LOOP WAS VECTORIZED
remark #15475: --- begin vector cost summary ---
remark #15476: scalar cost: 9
remark #15477: vector cost: 2.000
remark #15478: estimated potential speedup: 4.490

I understand vector length = 4 for double-precision floating-point number in AVX2 and expect the potential speedup is 4. Why the potential speedup here estimated is 4.49? Or if the scalar cost is 9, I expect the vector cost is 9 / 4 = 2.25, but not 2.0. Why the vector cost is 2.0?

Thanks in advance!

0 Kudos
1 Solution
MRajesh_intel
Moderator
996 Views

Hi,


The potential speedup depends on various factors such as micro-architecture, aligned and non-aligned data access etc.,


Calculation of scalar and vector costs also depends on multiple factors that cannot be disclosed.


You can refer to this documentation for further details on optimization reports:


https://software.intel.com/content/www/us/en/develop/articles/getting-the-most-out-of-your-intel-compiler-with-the-new-optimization-reports.html


Regards

Rajesh.


View solution in original post

4 Replies
MRajesh_intel
Moderator
997 Views

Hi,


The potential speedup depends on various factors such as micro-architecture, aligned and non-aligned data access etc.,


Calculation of scalar and vector costs also depends on multiple factors that cannot be disclosed.


You can refer to this documentation for further details on optimization reports:


https://software.intel.com/content/www/us/en/develop/articles/getting-the-most-out-of-your-intel-compiler-with-the-new-optimization-reports.html


Regards

Rajesh.


MRajesh_intel
Moderator
954 Views

Hi,

 

Thanks for accepting as a Solution

 

As this issue has been resolved, we will no longer respond to this thread. If you require any additional assistance from Intel, please start a new thread. Any further interaction in this thread will be considered community only.

 

Have a Good day.

 

Regards

Rajesh.

 

0 Kudos
jimdempseyatthecove
Honored Contributor III
942 Views

You might find it instructional to compare the run times between a "generic" non-optimized build and your optimized build.

Then compare the code generated. This might provide some insight as to both the accuracy of this estimate and as to what was done by the compiler.

Caution. Make sure your runtime test makes use of the output (in array a) after the loop, else the optimizer may remove the loop as if it were "dead code". Also the runtime tests should be iterated a few times after program startup with discarding the first iteration.

Jim Dempsey

0 Kudos
xwuupb
Novice
937 Views

Thanks for the info. Recently I learned a lot about vectorization. For my code I unrolled the inner short loops and use

#pragma omp simd

to explicitly vectorize the then outer loop (because all inner short loops are unrolled). Then really get fantastic speedup. I think both vectorization and loop unrolling contribute to the performance.

0 Kudos
Reply