Community
cancel
Showing results for 
Search instead for 
Did you mean: 
xwuupb
Beginner
117 Views

AVX2 vectorization: estimated potential speedup 4.5 for double precision

Jump to solution

Hi everyone,

here is a very simple for-loop:

double a[8192];
for (i = 0; i < 8192; ++i) {
    a[i] = i * 32.0 + 8.0;
}

 The compilation command is:

icc -O3 -xCORE-AVX2 -qopt-report-phase=all -qopt-report=5

The following is part of the optimization report:

remark #15305: vectorization support: vector length 4
remark #15300: LOOP WAS VECTORIZED
remark #15475: --- begin vector cost summary ---
remark #15476: scalar cost: 9
remark #15477: vector cost: 2.000
remark #15478: estimated potential speedup: 4.490

I understand vector length = 4 for double-precision floating-point number in AVX2 and expect the potential speedup is 4. Why the potential speedup here estimated is 4.49? Or if the scalar cost is 9, I expect the vector cost is 9 / 4 = 2.25, but not 2.0. Why the vector cost is 2.0?

Thanks in advance!

0 Kudos
1 Solution
MRajesh_intel
Moderator
90 Views

Hi,


The potential speedup depends on various factors such as micro-architecture, aligned and non-aligned data access etc.,


Calculation of scalar and vector costs also depends on multiple factors that cannot be disclosed.


You can refer to this documentation for further details on optimization reports:


https://software.intel.com/content/www/us/en/develop/articles/getting-the-most-out-of-your-intel-com...


Regards

Rajesh.


View solution in original post

4 Replies
MRajesh_intel
Moderator
91 Views

Hi,


The potential speedup depends on various factors such as micro-architecture, aligned and non-aligned data access etc.,


Calculation of scalar and vector costs also depends on multiple factors that cannot be disclosed.


You can refer to this documentation for further details on optimization reports:


https://software.intel.com/content/www/us/en/develop/articles/getting-the-most-out-of-your-intel-com...


Regards

Rajesh.


View solution in original post

MRajesh_intel
Moderator
48 Views

Hi,

 

Thanks for the confirmation!

 

As this issue has been resolved, we will no longer respond to this thread. If you require any additional assistance from Intel, please start a new thread. Any further interaction in this thread will be considered community only.

 

Have a Good day.

 

Regards

Rajesh.


jimdempseyatthecove
Black Belt
36 Views

You might find it instructional to compare the run times between a "generic" non-optimized build and your optimized build.

Then compare the code generated. This might provide some insight as to both the accuracy of this estimate and as to what was done by the compiler.

Caution. Make sure your runtime test makes use of the output (in array a) after the loop, else the optimizer may remove the loop as if it were "dead code". Also the runtime tests should be iterated a few times after program startup with discarding the first iteration.

Jim Dempsey

xwuupb
Beginner
31 Views

Thanks for the info. Recently I learned a lot about vectorization. For my code I unrolled the inner short loops and use

#pragma omp simd

to explicitly vectorize the then outer loop (because all inner short loops are unrolled). Then really get fantastic speedup. I think both vectorization and loop unrolling contribute to the performance.

Reply