- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello All,
Hope I am asking in the right forum!!
I have a simple/naive question, , I made a simple program to run on one thread of KNL (68 cores, Flat-Quadrant, MCDRAM used). I ran my code twice with the following configurations:
1) #pragma simd reduction(...) at the top of the loop and compiler option -xMIC_AVX512.
2) #pragma novector and removed -xMIC_AVX512 and added -no-simd. The loop is not vectorized and no AVX instructions are used (checked the assembly file).
The GFLOPS of the first one is 1.5 GFLOPS and for the second one is 0.8. The speedup is almost 2X only. Can anyone please explain why I don't get a good speedup (Closer to 8) ?
long count = 10000000
//Same loop for Cold Start
stime = dsecnd();
//1) #pragma simd reduction(+:result) //2) #pragma novector for (long i = 0; i < count; i++ ) {
result += (A * B); } etime = dsecnd(); double bestExTime = (etime - stime); double gplops = (1.e-9 * 2.0 * count) / bestExTime; printf("%f,%f\n" ,result, gplops);
Thanks,
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
result must be initialized prior to reduction loop. You must take care that the arrays are set to reasonable values which don't raise exceptions.
You would need to report how it scales with number of cores, preferably by setting hw_subset and with fast memory setting.
Due to these platform dependent issues, the Mic forum may be more useful.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Tim,
Thanks for the reply, I am sorry I did not show the initialization code for brevity. For the time being, my focus is to compare a simple vectorized loop to a scalar loop.
This is my complete code:
long count = 10000000; double *A = (double*)_mm_malloc(count * sizeof(double), 64), *B = (double*)_mm_malloc(count * sizeof(double), 64); A[0:count] = 0.001 * (rand()%1000); B[0:count] = 0.001 * (rand()%1000); double stime = 0, etime = 0.0; double result = 0.0; /*Cold Start: Same loop*/ stime = dsecnd(); //#pragma novector #pragma simd for ( i = 0; i < count; i++ ) { result += (A * B); } etime = dsecnd(); double bestExTime = (etime - stime) ; double gplops = (1.e-9 * 2.0 * count) / bestExTime; printf("%f,%d,%f\n" ,result, count , gplops); memory_free: _mm_free(A); _mm_free(B);
Thanks,
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Your arrays are each 80 million Bytes, so they greatly exceed the size of the available cache. Performance will be limited by the sustainable memory bandwidth of a core, not by the computational rate. Your results are consistent with this, requiring 16 Bytes of memory reads for each 2 FP operations, so 1.5 GFLOPS = 12 GB/s. This is consistent with other measurements of the maximum sustainable memory bandwidth of a single core on this processor.
On some Intel processors there is little difference between vectorized and non-vectorized code in this case, but on KNL there is a modest preference for 512-bit vectors. Although the maximum rate for scalar FMAs is 1/cycle (=3 GFLOPS at 1.5 GHz), the extra instructions required for loading each of the elements separately and performing the FMAs separately is enough to interfere with the overlap of computation and data transfers. (It is also likely that the "novector" clause will reduce the aggressiveness of the compiler in using multiple independent accumulators, which will lead to additional stall cycles that also interfere with overlap of computation and data transfer.)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks John, your rich comment really helped. If my data size is within the L1 cache, the vectorized code can get up to 9X speedup.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Sergey,
Actually, my purpose of this simple experiment is to measure the speedup of using the VPU of the Knights Landing processor on a single core.
Thanks,
Mohammad
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Single-core bandwidth is not limited by the bandwidth side of the memory subsystem on recent (server class) processors. Instead it is limited by the amount of concurrency that a single core can generate.
For the Xeon Phi x200, the average latency to DDR memory is about 130 ns (http://sites.utexas.edu/jdm4372/2016/12/06/memory-latency-on-the-intel-xeon-phi-x200-knights-landing-processor/), so the observed maximum sustainable bandwidth for a single core (~12 GB/s) corresponds to 130ns*12GB/s = 1560 Bytes = ~24 cache lines "in flight" at any one time. This is more cache misses than the processor core can directly control, and is pretty close to the limit of the number of outstanding requests that the L2 can handle.
Although I don't know of any reference where the number of concurrent misses that the L2 can handle is described in detail, the "Intel Xeon Phi Processor Performance Monitoring Reference Manual, Volume 2: Events" (document 334480) says that the occupancy counter for the "Table of Requests" in the CHA ("Caching and Home Agent") can increment by a maximum of 28 per cycle, suggesting a limit of 28 concurrent transactions.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page