- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I wrote a simple function and executed it on a KNL processor (68 cores, Flat Quadrature, using MCDRAM) using only one thread and n=10,000,000. I execute this function 100 times and take the average, then calculate the GFLOPS using the following formula gflops = (1e-9 * 2.0 * n ) / execution time
double multiplyAccum(long n,double *A, double *B) { long i; double result = 0; #pragma novector //#pragma simd for ( i = 0; i < n; i++ ) { result += A * B; } return result; }
1) When I use #pragma novector, I get 0.839571 GFLOPS/s
This is the compiler report for the loop:
remark #15319: loop was not vectorized: novector directive used
remark #25439: unrolled with remainder by 8
remark #25456: Number of Array Refs Scalar Replaced In Loop: 1
remark #25457: Number of partial sums replaced: 1
When I use #pragma simd, I get 1.495788 GFLOPS/s
This is the compiler report for the loop:
remark #15388: vectorization support: reference A_34279 has aligned access [ multiplyAccum.cpp(64,3) ]
remark #15388: vectorization support: reference B_34279 has aligned access [ multiplyAccum.cpp(64,3) ]
remark #15305: vectorization support: vector length 8
remark #15399: vectorization support: unroll factor set to 8
remark #15309: vectorization support: normalized vectorization overhead 0.446
remark #15301: SIMD LOOP WAS VECTORIZED
remark #15448: unmasked aligned unit stride loads: 2
remark #15475: --- begin vector loop cost summary ---
remark #15476: scalar loop cost: 9
remark #15477: vector loop cost: 0.870
remark #15478: estimated potential speedup: 10.280
remark #15488: --- end vector loop cost summary ---
remark #25015: Estimate of max trip count of loop=156250
The potential speedup is 10X, while I only get 1.8X, What is the explanation for this ?
Thanks,
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You may find a non-specific answer by reading about Amdahl's Law and its later modifications.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you for your advice.
I only measure the execution time of the 'for loop' and nothing else. How is Amdahl's law related ? If I measure the execution time of the whole program I would say that the serial part, data movements. etc .. have the dominant impact on performance. But in my case, I am only interested in the vectorized part of my program.
[Start timing]
//#pragma simd |
07 |
for ( i = 0; i < n; i++ ) |
08 |
{ |
09 |
result += A * B; |
[End Timing]
Thanks,
10 |
} |
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
#pragma simd requires the reduction clause to be explicit. Where is your mkl question?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Mohammad ,
If you'd like to try MKL function, you may try replace the function multiplyAccum()
with double cblas_ddot ( ) Computes a vector-vector dot product and compile it with
icc yourmainc.cpp -mkl and let us know if any result.
FYI: mkl user guide https://software.intel.com/en-us/node/528582 about the memory alignment etc to improve the performance.
and the mkl developer manual: https://software.intel.com/en-us/mkl-developer-reference-c
double cblas_ddot (const MKL_INT n, const double *x, const MKL_INT incx, const double
*y, const MKL_INT incy);
Best Regards,
Ying
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It is not generally helpful to post the same question on multiple forums. This question does not belong here in the Intel MKL forum, since it has nothing to do with MKL. Appropriate forums might include:
- https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring
- https://software.intel.com/en-us/forums/intel-moderncode-for-parallel-architectures
- https://software.intel.com/en-us/forums/intel-many-integrated-core
- https://software.intel.com/en-us/forums/intel-c-compiler
The performance on this kernel is limited by a single core's memory bandwidth, as I explained in response to your post in the C Compiler Forum at https://software.intel.com/en-us/forums/intel-c-compiler/topic/726771#comment-1902493

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page