Optimizing Intel Performance

RKraw · ‎11-06-2015

Hello, I am writing beacuse I for some time I have been working with Intel Xeon E3 processor as well as with Intel Xeon Phi MIC card. I focused on the following books: Structured Parallel Programming: Patterns for Efficient Computation INTEL XEON PHI COPROCESSOR HIGH PERFORMANCE PROGRAMMING BOOK However, when comparing efficiency of both Xeon Phi and Xeon (when using OpenMP, TBB and Intel Cilk Plus) I noticed either slowdown or slight imprvement of Phi over Intel Xeon. Having used examples from Structured Parallel Programming, I noticed that for instance SAXPY operation was faster on Intel Xeon when using float 3 vectors of size 90 MB. Iam quite aware that the largest speedup is achieved when data resides in cache qand is reused as long ass possible(just like in stencil operation example given in latter mentioned book where the memory access pattern rendered 60 times speedup- the hellofolps example, on a contrary, had comparable execution time in Xeon and Xeon Phi). However, I am looking for other clues or suggestions on which to focus in order to achieve highest boost of performance. It turns out that embarassingly parallel algorithms such as SAXPY render either no speedup or slowdown even with vectorization enabled. Is it a reason of architecture or is there a method to achieve high speedup ? How to make TBB and Cilk Plus parallel fors fit into cache lines of xeon Phi to avoid using same line by many cores thereby limiting performance. I have seen tutorials in developers zone of Xeon Phi, but I do not recognize such clues were given. I also attempted to use Intel Advisor, but despite calculating approximate execution time, it does not give clues how to improve algorithm on Xeon Phi (focusing mainly on vectorization) Which is also interesting, when scaling problem (i.e. setting number of OMP_NUM_THREADS) the highest perfoemance is not achieved tof 56 threads or its multiplication, but somewhere between 16 and 32 threads, when not all cores are even used to compute (i used only native mode and calculated only computation - not data transfer). If you have any clues, I am welcome to hear them, as in terms of speedup on Xeon Phi I only achieved insignificant speedups or slowdowns. Best Regards

TimP · ‎11-06-2015

In those references you should have noticed emphasis on combined vectorization and threaded parallelism to enable expected performance. I can't guess what you mean by enabling vectorization, as it is a default when compiling for intel(r) Xeon phi(TM). Either mkl or compiled source code saxpy should demonstrate full single thread vector performance. You can compare non vector performance by compiling source code with -no-vec.

You should have seen discussion of options for setting threaded affinity including kmp_place_threads and how those enable scaling to a larger number of threads (at least with static scheduling) than Cilk_for.

RKraw · ‎11-13-2015

Hello,

Thank you for your quick response. By enabling vectorication I meant both using -O3 compilation option in icc and/ or using Intel's annotations for compiler heuristics. Regarding OpenMP and configuring its variables, I am now investigating potential speedup on Xeon Phi in thrivial embarassingly parallel saxpy operation in OpenMP. To illustrate, I have a rather thrivial example given below:

#define SIZE_ARRAY 1024*1024*512

int n=SIZE_ARRAY

float *restrict a_1 =(float*)_mm_malloc(sizeof(float)*SIZE_ARRAY,64);
float *restrict a_2 =(float*)_mm_malloc(sizeof(float)*SIZE_ARRAY,64);

#pragma omp parallel for
for (int ii = 0; ii < n; ++ii)
   {
           y[ii] = a * x[ii] + y[ii];
   }

Now, depending on the n elements' size, in comparison with Intel Xeon processor, I either have a slowdown on Intel Xeon Phi (which occurs when processing with approximately n< 256000000) or small speedup (e.g. when operating on n above 512000000 ). I am testing the performance in native mode for assessing Xeon Phi performance. Is there a method to achieve a further speedup ? As I understand, the compute- to - data transfer ratio is low, however I have seen an article in whom the theoretical bandwith of 180 GM/s was achieved (and as I understand the core function was similar to what I am doing):

https://software.intel.com/en-us/articles/optimizing-memory-bandwidth-on-stream-triad

Namely, is there a method to boost the performance for such functions or for the Phi architecture such algorithms will perform poorly due to lack of data reuse in cache ?

I have checked several variants of OMP_NUM_THREADS and KMP_AFFINITY. I did not achieve significant speedup over Xeon though.

jimdempseyatthecove · ‎11-13-2015

Rafal,

Presumably you meant _mm_malloc of x and y arrays, and that the arrays have been initialized.

You might experiment with adding simd to your omp pragmas:

#pragma omp simd parallel for

The particular loop you have shown has virtually no computational overhead. As such, it will be dependent more on memory bandwidth than computation.

Jim Dempsey

TimP · ‎11-13-2015

As well as considering Jim's suggestion to ensure that you have both vectorization and threaded parallelism, you might consider that the MIC compilation has software prefetch on by default, while the host has it off by default. As you are performing a memory bandwidth test, you will want to limit the host (and maybe also MIC) test to at most one thread per core (with affinity settings to spread out across cores).

Your test doesn't look typical of applications which could benefit from cache re-use, but you may be getting some re-use of L3 on the host at your smaller array sizes.

RKraw · ‎11-30-2015

Thank you for your responses,

I am testing only calculations excluding any memory allocations and deallocations and I/O operations, so this was not the case. Although I introduced both pragma omp simd and compiled the code with various optimization flags (-O2, -O3), I still get the same results, i.e. slower performance below 64 MB data and slight speedup above 64 MB data. I was using export KMP_AFFINITY=scatter - however, I achieved the largest performance at OMP_NUM_THREADS=16, that is with severe cores underutilization with only 16 out of 57 cores doing actual work.

Although I imposed no prefetch in the inner loop, still no speedup on xeon phi was achieved. The code was thus in that case:

#pragma omp parallel for simd
//#pragma noprefetch
for ( ii = 0; ii < n; ++ii)
   {
           y[ii] = a * x[ii] + y[ii];
   }

Curious enough, with faster GDDR5 memory compared to DDR3 and higher memory throughput in the MIC, an embarassingly parallel problem, (although memory bound), is either slower or slightly faster on Xeon Phi as compared to Xeon.

Best regards

jimdempseyatthecove · ‎11-30-2015

The code shown is not the actual test case. You do not show how the data was initialized prior to timed section of code illustrated above. Was it

a) uninitialized after allocation and the OpenMP thread pool not yet established?
b) initialized after allocation and the OpenMP thread pool not yet established?
c) initialized after allocation with the OpenMP thread pool initialized using different thread pool size than that under timed test.
d) initialized after allocation with the OpenMP thread pool initialized using same thread pool size than that under timed test.
e) other permutations...

What happens to memory and thread teaming, and more importantly, how it happens affects performance greatly. You, as the programmer, assume the responsibility to utilize your skill at scheduling (coordinating) the threads with the work.

Jim Dempsey