- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In those references you should have noticed emphasis on combined vectorization and threaded parallelism to enable expected performance. I can't guess what you mean by enabling vectorization, as it is a default when compiling for intel(r) Xeon phi(TM). Either mkl or compiled source code saxpy should demonstrate full single thread vector performance. You can compare non vector performance by compiling source code with -no-vec.
You should have seen discussion of options for setting threaded affinity including kmp_place_threads and how those enable scaling to a larger number of threads (at least with static scheduling) than Cilk_for.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
Thank you for your quick response. By enabling vectorication I meant both using -O3 compilation option in icc and/ or using Intel's annotations for compiler heuristics. Regarding OpenMP and configuring its variables, I am now investigating potential speedup on Xeon Phi in thrivial embarassingly parallel saxpy operation in OpenMP. To illustrate, I have a rather thrivial example given below:
#define SIZE_ARRAY 1024*1024*512
int n=SIZE_ARRAY
float *restrict a_1 =(float*)_mm_malloc(sizeof(float)*SIZE_ARRAY,64);
float *restrict a_2 =(float*)_mm_malloc(sizeof(float)*SIZE_ARRAY,64);
#pragma omp parallel for
for (int ii = 0; ii < n; ++ii)
{
y[ii] = a * x[ii] + y[ii];
}
Now, depending on the n elements' size, in comparison with Intel Xeon processor, I either have a slowdown on Intel Xeon Phi (which occurs when processing with approximately n< 256000000) or small speedup (e.g. when operating on n above 512000000 ). I am testing the performance in native mode for assessing Xeon Phi performance. Is there a method to achieve a further speedup ? As I understand, the compute- to - data transfer ratio is low, however I have seen an article in whom the theoretical bandwith of 180 GM/s was achieved (and as I understand the core function was similar to what I am doing):
https://software.intel.com/en-us/articles/optimizing-memory-bandwidth-on-stream-triad
Namely, is there a method to boost the performance for such functions or for the Phi architecture such algorithms will perform poorly due to lack of data reuse in cache ?
I have checked several variants of OMP_NUM_THREADS and KMP_AFFINITY. I did not achieve significant speedup over Xeon though.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Rafal,
Presumably you meant _mm_malloc of x and y arrays, and that the arrays have been initialized.
You might experiment with adding simd to your omp pragmas:
#pragma omp simd parallel for
The particular loop you have shown has virtually no computational overhead. As such, it will be dependent more on memory bandwidth than computation.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
As well as considering Jim's suggestion to ensure that you have both vectorization and threaded parallelism, you might consider that the MIC compilation has software prefetch on by default, while the host has it off by default. As you are performing a memory bandwidth test, you will want to limit the host (and maybe also MIC) test to at most one thread per core (with affinity settings to spread out across cores).
Your test doesn't look typical of applications which could benefit from cache re-use, but you may be getting some re-use of L3 on the host at your smaller array sizes.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you for your responses,
I am testing only calculations excluding any memory allocations and deallocations and I/O operations, so this was not the case. Although I introduced both pragma omp simd and compiled the code with various optimization flags (-O2, -O3), I still get the same results, i.e. slower performance below 64 MB data and slight speedup above 64 MB data. I was using export KMP_AFFINITY=scatter - however, I achieved the largest performance at OMP_NUM_THREADS=16, that is with severe cores underutilization with only 16 out of 57 cores doing actual work.
Although I imposed no prefetch in the inner loop, still no speedup on xeon phi was achieved. The code was thus in that case:
#pragma omp parallel for simd
//#pragma noprefetch
for ( ii = 0; ii < n; ++ii)
{
y[ii] = a * x[ii] + y[ii];
}
Curious enough, with faster GDDR5 memory compared to DDR3 and higher memory throughput in the MIC, an embarassingly parallel problem, (although memory bound), is either slower or slightly faster on Xeon Phi as compared to Xeon.
Best regards
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The code shown is not the actual test case. You do not show how the data was initialized prior to timed section of code illustrated above. Was it
a) uninitialized after allocation and the OpenMP thread pool not yet established?
b) initialized after allocation and the OpenMP thread pool not yet established?
c) initialized after allocation with the OpenMP thread pool initialized using different thread pool size than that under timed test.
d) initialized after allocation with the OpenMP thread pool initialized using same thread pool size than that under timed test.
e) other permutations...
What happens to memory and thread teaming, and more importantly, how it happens affects performance greatly. You, as the programmer, assume the responsibility to utilize your skill at scheduling (coordinating) the threads with the work.
Jim Dempsey

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page