Intel® oneAPI Math Kernel Library
Ask questions and share information with other developers who use Intel® Math Kernel Library.

## Simple vectcorization question

Beginner
343 Views

Hi,

I wrote a simple function and executed it on a KNL processor (68 cores, Flat Quadrature, using MCDRAM) using only one thread and n=10,000,000. I execute this function 100 times and take the average, then calculate the GFLOPS using the following formula gflops = (1e-9 * 2.0 * n ) / execution time

double multiplyAccum(long n,double *A, double *B)
{
long i;
double result = 0;
#pragma novector
//#pragma simd
for ( i = 0; i < n; i++ )
{
result += A * B;
}
return result;
}

1) When I use #pragma novector, I get 0.839571 GFLOPS/s

This is the compiler report for the loop:

remark #15319: loop was not vectorized: novector directive used
remark #25439: unrolled with remainder by 8
remark #25456: Number of Array Refs Scalar Replaced In Loop: 1
remark #25457: Number of partial sums replaced: 1

When I use #pragma simd, I get  1.495788 GFLOPS/s

This is the compiler report for the loop:

remark #15388: vectorization support: reference A_34279 has aligned access   [ multiplyAccum.cpp(64,3) ]
remark #15388: vectorization support: reference B_34279 has aligned access   [ multiplyAccum.cpp(64,3) ]
remark #15305: vectorization support: vector length 8
remark #15399: vectorization support: unroll factor set to 8
remark #15309: vectorization support: normalized vectorization overhead 0.446
remark #15301: SIMD LOOP WAS VECTORIZED
remark #15475: --- begin vector loop cost summary ---
remark #15476: scalar loop cost: 9
remark #15477: vector loop cost: 0.870
remark #15478: estimated potential speedup: 10.280
remark #15488: --- end vector loop cost summary ---
remark #25015: Estimate of max trip count of loop=156250

The potential speedup is 10X, while I only get 1.8X, What is the explanation for this ?

Thanks,

5 Replies
Black Belt
343 Views

Beginner
343 Views

I only measure the execution time of the 'for loop' and nothing else. How is Amdahl's law related ? If I measure the execution time of the whole program I would say that the serial part, data movements. etc .. have the dominant impact on performance. But in my case, I am only interested in the vectorized part of my program.

[Start timing]

 //#pragma simd
 07 for ( i = 0; i < n; i++ )
 08 {
 09 result += A * B;

[End Timing]

Thanks,

 10 }
Black Belt
343 Views

#pragma simd requires the reduction clause to be explicit.  Where is your mkl question?

Employee
343 Views

If you'd like to try MKL function,  you may try  replace the function  multiplyAccum()

with double cblas_ddot ( )  Computes a vector-vector dot product and compile it with

icc yourmainc.cpp  -mkl    and let us know if any result.

FYI:  mkl user guide  https://software.intel.com/en-us/node/528582  about the memory alignment etc to improve the performance.

and the mkl developer manual: https://software.intel.com/en-us/mkl-developer-reference-c
double cblas_ddot (const MKL_INT n, const double *x, const MKL_INT incx, const double
*y, const MKL_INT incy
);

Best Regards,

Ying

Black Belt
343 Views

It is not generally helpful to post the same question on multiple forums.  This question does not belong here in the Intel MKL forum, since it has nothing to do with MKL.  Appropriate forums might include:

The performance on this kernel is limited by a single core's memory bandwidth, as I explained in response to your post in the C Compiler Forum at https://software.intel.com/en-us/forums/intel-c-compiler/topic/726771#comment-1902493