Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Mohammad_A_
Beginner
60 Views

Simple vectcorization question

 

Hi,

I wrote a simple function and executed it on a KNL processor (68 cores, Flat Quadrature, using MCDRAM) using only one thread and n=10,000,000. I execute this function 100 times and take the average, then calculate the GFLOPS using the following formula gflops = (1e-9 * 2.0 * n ) / execution time

double multiplyAccum(long n,double *A, double *B)
{
    long i;
    double result = 0;
    #pragma novector
    //#pragma simd
    for ( i = 0; i < n; i++ )
    {
        result += A * B;
    }
    return result;
}

1) When I use #pragma novector, I get 0.839571 GFLOPS/s

This is the compiler report for the loop:

      remark #15319: loop was not vectorized: novector directive used
      remark #25439: unrolled with remainder by 8  
      remark #25456: Number of Array Refs Scalar Replaced In Loop: 1
      remark #25457: Number of partial sums replaced: 1

When I use #pragma simd, I get  1.495788 GFLOPS/s

This is the compiler report for the loop:

      remark #15388: vectorization support: reference A_34279 has aligned access   [ multiplyAccum.cpp(64,3) ]
      remark #15388: vectorization support: reference B_34279 has aligned access   [ multiplyAccum.cpp(64,3) ]
      remark #15305: vectorization support: vector length 8
      remark #15399: vectorization support: unroll factor set to 8
      remark #15309: vectorization support: normalized vectorization overhead 0.446
      remark #15301: SIMD LOOP WAS VECTORIZED
      remark #15448: unmasked aligned unit stride loads: 2 
      remark #15475: --- begin vector loop cost summary ---
      remark #15476: scalar loop cost: 9 
      remark #15477: vector loop cost: 0.870 
      remark #15478: estimated potential speedup: 10.280 
      remark #15488: --- end vector loop cost summary ---
      remark #25015: Estimate of max trip count of loop=156250

The potential speedup is 10X, while I only get 1.8X, What is the explanation for this ? 

 

Thanks,

0 Kudos
5 Replies
mecej4
Black Belt
60 Views

You may find a non-specific answer by reading about Amdahl's Law and its later modifications.

Mohammad_A_
Beginner
60 Views

Thank you for your advice.

I only measure the execution time of the 'for loop' and nothing else. How is Amdahl's law related ? If I measure the execution time of the whole program I would say that the serial part, data movements. etc .. have the dominant impact on performance. But in my case, I am only interested in the vectorized part of my program.

[Start timing]

//#pragma simd
07     for ( i = 0; i < n; i++ )
08     {
09         result += A * B;

 

[End Timing]

Thanks,

 

10     }
TimP
Black Belt
60 Views

#pragma simd requires the reduction clause to be explicit.  Where is your mkl question?

Ying_H_Intel
Employee
60 Views

Hi Mohammad ,

If you'd like to try MKL function,  you may try  replace the function  multiplyAccum()

with double cblas_ddot ( )  Computes a vector-vector dot product and compile it with

icc yourmainc.cpp  -mkl    and let us know if any result.

FYI:  mkl user guide  https://software.intel.com/en-us/node/528582  about the memory alignment etc to improve the performance.

and the mkl developer manual: https://software.intel.com/en-us/mkl-developer-reference-c
double cblas_ddot (const MKL_INT n, const double *x, const MKL_INT incx, const double
*y, const MKL_INT incy
);

Best Regards,

Ying

McCalpinJohn
Black Belt
60 Views

It is not generally helpful to post the same question on multiple forums.  This question does not belong here in the Intel MKL forum, since it has nothing to do with MKL.  Appropriate forums might include:

The performance on this kernel is limited by a single core's memory bandwidth, as I explained in response to your post in the C Compiler Forum at https://software.intel.com/en-us/forums/intel-c-compiler/topic/726771#comment-1902493

Reply