Community
cancel
Showing results for
Did you mean:
Beginner
117 Views

## AVX2 vectorization: estimated potential speedup 4.5 for double precision

Hi everyone,

here is a very simple for-loop:

`double a[8192];for (i = 0; i < 8192; ++i) {    a[i] = i * 32.0 + 8.0;}`

The compilation command is:

`icc -O3 -xCORE-AVX2 -qopt-report-phase=all -qopt-report=5`

The following is part of the optimization report:

`remark #15305: vectorization support: vector length 4remark #15300: LOOP WAS VECTORIZEDremark #15475: --- begin vector cost summary ---remark #15476: scalar cost: 9remark #15477: vector cost: 2.000remark #15478: estimated potential speedup: 4.490`

I understand vector length = 4 for double-precision floating-point number in AVX2 and expect the potential speedup is 4. Why the potential speedup here estimated is 4.49? Or if the scalar cost is 9, I expect the vector cost is 9 / 4 = 2.25, but not 2.0. Why the vector cost is 2.0?

1 Solution
Moderator
90 Views

Hi,

The potential speedup depends on various factors such as micro-architecture, aligned and non-aligned data access etc.,

Calculation of scalar and vector costs also depends on multiple factors that cannot be disclosed.

You can refer to this documentation for further details on optimization reports:

https://software.intel.com/content/www/us/en/develop/articles/getting-the-most-out-of-your-intel-com...

Regards

Rajesh.

4 Replies
Moderator
91 Views

Hi,

The potential speedup depends on various factors such as micro-architecture, aligned and non-aligned data access etc.,

Calculation of scalar and vector costs also depends on multiple factors that cannot be disclosed.

You can refer to this documentation for further details on optimization reports:

https://software.intel.com/content/www/us/en/develop/articles/getting-the-most-out-of-your-intel-com...

Regards

Rajesh.

Moderator
48 Views

Hi,

Thanks for the confirmation!

As this issue has been resolved, we will no longer respond to this thread. If you require any additional assistance from Intel, please start a new thread. Any further interaction in this thread will be considered community only.

Have a Good day.

Regards

Rajesh.

Black Belt
36 Views

You might find it instructional to compare the run times between a "generic" non-optimized build and your optimized build.

Then compare the code generated. This might provide some insight as to both the accuracy of this estimate and as to what was done by the compiler.

Caution. Make sure your runtime test makes use of the output (in array a) after the loop, else the optimizer may remove the loop as if it were "dead code". Also the runtime tests should be iterated a few times after program startup with discarding the first iteration.

Jim Dempsey

Beginner
31 Views

Thanks for the info. Recently I learned a lot about vectorization. For my code I unrolled the inner short loops and use

``#pragma omp simd``

to explicitly vectorize the then outer loop (because all inner short loops are unrolled). Then really get fantastic speedup. I think both vectorization and loop unrolling contribute to the performance.