Intel® C++ Compiler
Community support and assistance for creating C++ code that runs on platforms based on Intel® processors.

Testing SIMD on KNL

Mohammad_A_
Beginner
842 Views

Hello All,

Hope I am asking in the right forum!!

I have a simple/naive question, , I made a simple program to run on one thread of KNL (68 cores, Flat-Quadrant, MCDRAM used). I ran my code twice with the following configurations:

1) #pragma simd reduction(...) at the top of the loop and compiler option -xMIC_AVX512. 

2) #pragma novector and removed -xMIC_AVX512 and added -no-simd. The loop is not vectorized and no AVX instructions are used (checked the assembly file).

The GFLOPS of the first one is 1.5 GFLOPS and for the second one is 0.8. The speedup is almost 2X only. Can anyone please explain why I don't get a good speedup (Closer to 8) ? 

long count  = 10000000
//Same loop for Cold Start
stime = dsecnd();
//1) #pragma simd reduction(+:result)
//2) #pragma novector
for (long i = 0; i < count; i++ )
{
    result += (A * B);
}

etime = dsecnd();

double bestExTime = (etime - stime);
double gplops = (1.e-9 * 2.0 * count) / bestExTime;
printf("%f,%f\n" ,result,  gplops);

 Thanks,

0 Kudos
8 Replies
TimP
Honored Contributor III
842 Views

result must be initialized prior to reduction loop.  You must take care that the arrays are set to reasonable values which don't raise exceptions.

You would need to report how it scales with number of cores, preferably by setting hw_subset and with fast memory setting.

Due to these platform dependent issues, the Mic forum may be more useful.

0 Kudos
Mohammad_A_
Beginner
842 Views

Hi Tim,

Thanks for the reply, I am sorry I did not show the initialization code for brevity. For the time being, my focus is to compare a simple vectorized loop to a scalar loop.

This is my complete code:

        long count = 10000000;
	double *A = (double*)_mm_malloc(count * sizeof(double), 64),
	       *B = (double*)_mm_malloc(count * sizeof(double), 64);

	A[0:count] = 0.001 * (rand()%1000);
	B[0:count] = 0.001 * (rand()%1000);

	double stime = 0, etime = 0.0;
	double result = 0.0;

         /*Cold Start: Same loop*/

	stime = dsecnd();

        //#pragma novector
	#pragma simd
	for ( i = 0; i < count; i++ )
	{
		result += (A * B);
	}

	etime = dsecnd();

	double bestExTime = (etime - stime) ;
	double gplops = (1.e-9 * 2.0 * count) / bestExTime;
	printf("%f,%d,%f\n" ,result, count ,  gplops);

memory_free:
	_mm_free(A);
	_mm_free(B);

 

Thanks,

0 Kudos
McCalpinJohn
Honored Contributor III
842 Views

Your arrays are each 80 million Bytes, so they greatly exceed the size of the available cache.  Performance will be limited by the sustainable memory bandwidth of a core, not by the computational rate.   Your results are consistent with this, requiring 16 Bytes of memory reads for each 2 FP operations, so 1.5 GFLOPS = 12 GB/s.    This is consistent with other measurements of the maximum sustainable memory bandwidth of a single core on this processor.

On some Intel processors there is little difference between vectorized and non-vectorized code in this case, but on KNL there is a modest preference for 512-bit vectors.  Although the maximum rate for scalar FMAs is 1/cycle (=3 GFLOPS at 1.5 GHz), the extra instructions required for loading each of the elements separately and performing the FMAs separately is enough to interfere with the overlap of computation and data transfers.  (It is also likely that the "novector" clause will reduce the aggressiveness of the compiler in using multiple independent accumulators, which will lead to additional stall cycles that also interfere with overlap of computation and data transfer.)

0 Kudos
SergeyKostrov
Valued Contributor II
842 Views
Repeat your tests with: ... #pragma simd num_threads( 68 ) for ( i = 0; i < count; i++ ) { result += (A * B); } ... and set KMP_AFFINITY environment variable to scatter.
0 Kudos
Mohammad_A_
Beginner
842 Views

 

Thanks John, your rich comment really helped. If my data size is within the L1 cache, the vectorized code can get up to 9X speedup. 

 

 

0 Kudos
Mohammad_A_
Beginner
842 Views

Hi Sergey,

Actually, my purpose of this simple experiment is to measure the speedup of using the VPU of the Knights Landing processor on a single core.

 

Thanks,

Mohammad

0 Kudos
SergeyKostrov
Valued Contributor II
842 Views
>>...Your results are consistent with this, requiring 16 Bytes of memory reads for each 2 FP operations, so 1.5 GFLOPS = 12 GB/s... It is still well below an official bandwidth value for DDR4 memory for a 72-core KNL system: ... Code name : Knights Landing ( KNL ) Process technology : 14nm Number of Cores : 72 Atom Out-of-Order cores ( arranged in 36 Tiles and connected in a 2D Mesh architecture ) Hardware Threads per Core : 4 On-Package Memory : High Bandwidth MCDRAM ( up to 16GB / bandwidth >400GB/s ) Regular Memory : DDR4 ( up to 384GB / bandwidth > 80GB/s ) Memory Channels : 6 Instruction Set Architecture: Intel AVX-512 ( vector length 512-bit ) ... You could also verify your results with VTune.
0 Kudos
McCalpinJohn
Honored Contributor III
842 Views

Single-core bandwidth is not limited by the bandwidth side of the memory subsystem on recent (server class) processors.  Instead it is limited by the amount of concurrency that a single core can generate. 

For the Xeon Phi x200, the average latency to DDR memory is about 130 ns (http://sites.utexas.edu/jdm4372/2016/12/06/memory-latency-on-the-intel-xeon-phi-x200-knights-landing-processor/), so the observed maximum sustainable bandwidth for a single core (~12 GB/s) corresponds to 130ns*12GB/s = 1560 Bytes = ~24 cache lines "in flight" at any one time.  This is more cache misses than the processor core can directly control, and is pretty close to the limit of the number of outstanding requests that the L2 can handle. 

Although I don't know of any reference where the number of concurrent misses that the L2 can handle is described in detail, the "Intel Xeon Phi Processor Performance Monitoring Reference Manual, Volume 2: Events" (document 334480) says that the occupancy counter for the "Table of Requests" in the CHA ("Caching and Home Agent") can increment by a maximum of 28 per cycle, suggesting a limit of 28 concurrent transactions.

0 Kudos
Reply