Thank you for your advice.

Mohammad_A_ · ‎04-07-2017

Hi,

I wrote a simple function and executed it on a KNL processor (68 cores, Flat Quadrature, using MCDRAM) using only one thread and n=10,000,000. I execute this function 100 times and take the average, then calculate the GFLOPS using the following formula gflops = (1e-9 * 2.0 * n ) / execution time

double multiplyAccum(long n,double *A, double *B)
{
    long i;
    double result = 0;
    #pragma novector
    //#pragma simd
    for ( i = 0; i < n; i++ )
    {
        result += A * B;
    }
    return result;
}

1) When I use #pragma novector, I get 0.839571 GFLOPS/s

This is the compiler report for the loop:

remark #15319: loop was not vectorized: novector directive used
remark #25439: unrolled with remainder by 8
remark #25456: Number of Array Refs Scalar Replaced In Loop: 1
remark #25457: Number of partial sums replaced: 1

When I use #pragma simd, I get 1.495788 GFLOPS/s

This is the compiler report for the loop:

remark #15388: vectorization support: reference A_34279 has aligned access [ multiplyAccum.cpp(64,3) ]
remark #15388: vectorization support: reference B_34279 has aligned access [ multiplyAccum.cpp(64,3) ]
remark #15305: vectorization support: vector length 8
remark #15399: vectorization support: unroll factor set to 8
remark #15309: vectorization support: normalized vectorization overhead 0.446
remark #15301: SIMD LOOP WAS VECTORIZED
remark #15448: unmasked aligned unit stride loads: 2
remark #15475: --- begin vector loop cost summary ---
remark #15476: scalar loop cost: 9
remark #15477: vector loop cost: 0.870
remark #15478: estimated potential speedup: 10.280
remark #15488: --- end vector loop cost summary ---
remark #25015: Estimate of max trip count of loop=156250

The potential speedup is 10X, while I only get 1.8X, What is the explanation for this ?

Thanks,

mecej4 · ‎04-08-2017

You may find a non-specific answer by reading about Amdahl's Law and its later modifications.

Mohammad_A_ · ‎04-08-2017

Thank you for your advice.

I only measure the execution time of the 'for loop' and nothing else. How is Amdahl's law related ? If I measure the execution time of the whole program I would say that the serial part, data movements. etc .. have the dominant impact on performance. But in my case, I am only interested in the vectorized part of my program.

[Start timing]

	
				//#pragma simd

	
				07
				    for ( i = 0; i < n; i++ )

	
				08
				    {

	
				09
				        result += A * B;

 
[End Timing]
Thanks,
 
				10
				    }

TimP · ‎04-09-2017

#pragma simd requires the reduction clause to be explicit. Where is your mkl question?

Ying_H_Intel · ‎04-09-2017

Hi Mohammad ,

If you'd like to try MKL function, you may try replace the function multiplyAccum()

with double cblas_ddot ( ) Computes a vector-vector dot product and compile it with

icc yourmainc.cpp -mkl and let us know if any result.

FYI: mkl user guide https://software.intel.com/en-us/node/528582 about the memory alignment etc to improve the performance.

and the mkl developer manual: https://software.intel.com/en-us/mkl-developer-reference-c
double cblas_ddot (const MKL_INT n, const double *x, const MKL_INT incx, const double
*y, const MKL_INT incy);

Best Regards,

Ying

McCalpinJohn · ‎04-12-2017

It is not generally helpful to post the same question on multiple forums. This question does not belong here in the Intel MKL forum, since it has nothing to do with MKL. Appropriate forums might include:

The performance on this kernel is limited by a single core's memory bandwidth, as I explained in response to your post in the C Compiler Forum at https://software.intel.com/en-us/forums/intel-c-compiler/topic/726771#comment-1902493

Simple vectcorization question