FAQS: Compilers, Libraries, Performance, Profiling and Optimization.

Sumedh_N_Intel · ‎01-21-2013

In the period prior to the launch of Intel® Xeon Phi™ coprocessor, Intel collected questions from developers who had been involved in pilot testing. This document contains some of the most common questions asked. Additional information and Best-Known-Methods for the Intel Xeon Phi coprocessor can be found here.

The Intel® Compiler reference guides can be found at:

C/C++: http://software.intel.com/sites/products/documentation/doclib/stdxe/2013/composerxe/compiler/cpp-lin/index.htm

Fortran: http://software.intel.com/sites/products/documentation/doclib/stdxe/2013/composerxe/compiler/fortran-lin/index.htm

Addendum:

http://software.intel.com/sites/default/files/article/327178/intelmpi4.1-releasenotes-linux-addendum-for-mic.pdf

The Intel® Math Kernel Libraries (Intel® MKL) reference guide can be found at:

http://software.intel.com/sites/products/documentation/doclib/mkl_sa/11/mkl_userguide_lnx/index.htm

______________________________________________________________________________________________________

Q) What all do I need to run offload code on the Intel Xeon Phi coprocessor?

With Intel® Manycore Platform Software Stack (Intel® MPSS)2.0, everything uses a single “fat” binary that contains everything needed for executing on both the host and the coprocessor.

You can check the shared library dependencies for this binary using the following command:

~#/usr/linux-k1om-4.7/bin/x86_64-k1om-linux-readelf -d ./a.out | grep NEEDED

0x0000000000000001 (NEEDED) Shared library: [libm.so.6]
0x0000000000000001 (NEEDED) Shared library: [libiomp5.so]
0x0000000000000001 (NEEDED) Shared library: [liboffload.so.5]
0x0000000000000001 (NEEDED) Shared library: [libstdc++.so.6]
0x0000000000000001 (NEEDED) Shared library: [libgcc_s.so.1]
0x0000000000000001 (NEEDED) Shared library: [libpthread.so.0]
0x0000000000000001 (NEEDED) Shared library: [libc.so.6]
0x0000000000000001 (NEEDED) Shared library: [libdl.so.2]

Note that the offload compiler automatically loads all these shared library dependencies to the coprocessor. If the compiler is unable to find any of the required libraries, an appropriate warning or suggestion is displayed.
______________________________________________________________________________________________________

Q) The into modifier does not work correctly with offload or results in an error?

The into modifier enables you to transfer data from a variable on the host to another variable located on the coprocessor, and vice versa. When you use into with the in clause, data is copied from the CPU object to the coprocessor object. The alloc_if, free_if, and alloc modifiers apply to the into expression.

Similarly, when you use into with the out clause, data is copied from the coprocessor object to the CPU object. The alloc_if, free_if, and alloc modifiers apply to the out expression. However, there are certain conditions you need to fulfill for the into directive to work correctly with an offload.

The into modifier is not allowed with inout and nocopy clauses.
An overlap between the source and destination memory ranges leads to undefined behavior.
Shape change is not allowed, e.g. transferring from a 1D array to a 2D array.

More information can be found in the compiler reference at:
Key Features > Intel® Many Integrated Core Architecture (Intel® MIC Architecture) > Programming for Intel® MIC Architecture > Offload using a pragma > Moving Data from One Variable to Another.
______________________________________________________________________________________________________

Q) Why does my nocopy modifier not work correctly? Why does it generate a compiler or runtime error?

The operation of the nocopy clause is dependent on a number of factors. The following conditions must be met to ensure the correct operation of the nocopy modifier:

The coprocessor number must be set when using nocopy. By default, the offloads to the coprocessors happen in a round robin fashion, and hence, it is essential to let the compiler know which coprocessor to use for the offloads. For example,

#pragma offload target(mic:0)” nocopy(a:length(10) alloc_if(0) free_if(0))

All dynamically allocated variables that need to be moved to the coprocessor should be global and declared with directive __attribute__((target(mic)))
Ensure that the memory used in the nocopy has already been allocated and is persisted by using the alloc_if and free_if modifiers.
Another alternative for nocopy is using an in/out clause with length set to 0:

#pragma offload target(mic:0) in(a:length(0) alloc_if(0) free_if(0))

For more information on nocopy, please refer to the following web page:

http://software.intel.com/en-us/articles/effective-use-of-the-intel-compilers-offload-features
______________________________________________________________________________________________________

Q) Between offloads, how does statically allocated, stack allocated and heap allocated data persist?

Refer to the following link for more details on persistence of data across offloads:

http://software.intel.com/en-us/articles/effective-use-of-the-intel-compilers-offload-features
______________________________________________________________________________________________________

Q) What is the default directory on the coprocessor to which files are written?

For an Offload code or Native code executed using micnativeloadex utility:
If a directory for the file I/O is not specified, then the file is written to /tmp/coi_procs/<card #>/<PID>.
So if I am offloading to card #1, and the offload is handled by process PID#2929, then the default directory on the Intel Xeon Phi coprocessor is /tmp/coi_procs/1/2929.
Note that the numbering of the Intel Xeon Phi coprocessors on the system starts at 1. For example, the first coprocessor, located at 192.168.1.100, is coprocessor #1. This is different than in the pragma offload target specification (target(mic:0)).

For a Native code executed after being copied to the coprocessor using scp:
If the directory is not specified, the file is created in the user's home directory which generally is “/home/userid” for a non-root user. If the user is logged in as the root (via sudo or otherwise) the file is created at "/root" which is the root home directory.
______________________________________________________________________________________________________

Q) What happens if both host and coprocessor write to / read from the same file?

The same behavior as results from any such NFS conflict
_____________________________________________________________________________________________________

Q) When I see an array of length(0) in an offload pragma, what does it mean?

Pointers used within offload regions are by default inout, that is, data associated with them is transferred in and out. Sometimes data may be used strictly locally; it is assigned and used on the coprocessor only. The nocopy clause is useful in this case to leave the data unmodified by the offload clauses, and allow the programmer to explicitly manage its contents. In other cases, data is transferred into the location from the CPU, and a subsequent offload may want to either

a) use the same memory allocated and transfer fresh data into it, or

b) keep the same memory and reuse the same data.

For case a), an in clause with length equal to the number of elements is useful. For case b) an in clause with length of 0 can be used to “refresh” the pointer but avoid any data transfer

The following table gives a complete description of how to use in/out/nocopy with the length clause:

_____________________________________________________________________________________________________

Q) What is processor affinity and how do we set it?

Here is a small excerpt from a white paper that will help answer your questions. More information about setting the thread affinity through the OpenMP runtime can be found in point 7 (below).

1 Why Worry ‘Bout a Thing?
On a single die Intel® Xeon® processor based system, pinning threads to cores is often only a minor optimization, since the shared L3 cache provides fast inter-thread communication between all the threads. However, on Intel Xeon Phi coprocessor there is no shared L3 cache and it is therefore more important to ensure that threads stay near the caches that contain the data they have touched, and aren’t moved around by the OS. The way to achieve that is to force thread affinity.

2 Linux* Affinity Calls
Inside the kernel Linux maintains a cpu_set_t for each thread. This is a set of integers (implemented as a bitset) that contains the logical CPUs on which the thread can be run. When a thread is created (as a result of a fork() or pthread_create() call) it inherits its affinity from its parent (the thread that made the creation call). The logical CPU numbers used here align with those used by the kernel elsewhere, for instance in /proc/cpuinfo.

Threads can change their affinity by using the sched_setaffinity() call, and discover their existing affinity using sched_getaffinity() (documented in the same place). sched_setaffinity() lets you force the affinity to any value you choose; by using it you can escape the affinity that you started with, allowing your thread to run on parts of the machine that its parent could not run on, though doing that is rather bad manners unless you really know everything that is running on the machine. (Hint, you probably don’t know that even if you think you do!)

3 Mapping Hardware to Logical CPUs
Since the affinity calls all deal with logical CPUs, if we’re to get the correct affinity for our threads we need to understand how the kernel’s logical CPU enumeration maps onto the physical cores and hardware threads in the Intel Xeon Phi coprocessor. That mapping looks like this

4 Granularity of Affinity
Remember that the affinity is a set of logical CPUs on which a thread can run. We can therefore restrict a thread either to any of the logical CPUs that map to the same physical core (core affinity), or, more finely, to a specific hardware thread on that core (thread affinity). Mapping to a core allows the kernel more freedom to move the thread, which is potentially useful if one of the hardware threads is taking interrupts. On the other hand, the OS may abuse that freedom and moving thread between logical CPUs isn’t a free operation even when they are sharing all levels of cache.

We have generally observed that binding to thread granularity provides more consistent results, though binding to core level can sometimes give better average performance over many runs. So, “your mileage may vary”, and this may be worth experimenting with.

To set a core affinity you should use CPU_SET() to create a cpu_set_t that contains each of the four logical CPUs that map to the same physical core, and then use sched_setaffinity() to force the appropriate affinity. (Or, if you are creating pthreads yourself, you could use the same cpu_set_t at pthread_create() time.)

To set a thread level affinity you should create a cpu_set_t with a single logical CPU enabled in it

5 Pre-existing Affinities
In most circumstances the affinity that is inherited will allow the thread to run on any logical CPU in the machine. However, there are a number of exceptions to that

When executed by the offload mechanism the affinity is set so that the last physical core in the machine will not be used. (Since Intel parallel runtimes use the number of available logical CPUs in the incoming affinity to determine the correct number of threads to run, this is why an offloaded OpenMP code will use four fewer threads by default than the number of available HW threads in the machine).
In native mode if the user uses the taskset command they can set the initial affinity to a subset of the machine (taskset will be supported in an upcoming release of Intel MPSS).
In MPI, the MPI system can be used to set the affinity of MPI processes to that each can run on only a subset of the machine.

In each of these cases the affinity mask has been changed to reflect a sensible use of the machine that the process itself cannot easily determine. This is why when setting affinity by hand it is polite only to reduce the set of available logical CPUs on which a thread can run, not simply force it.

6 Sensible Affinities
Under Intel MPSS many of the kernel services and daemons are affinitized to the “Bootstrap Processor” (BSP), which is the last physical core. This is also where the offload daemon runs the services required to support data transfer for offload. It is therefore generally sensible to avoid using this core for user code. (Indeed, as already discussed, the offload system does that automatically by removing the logical CPUs on the last core from the default affinity of offloaded processes).

7 OpenMP
So far we’ve been talking about affinities at the level of cpu_set_t and system calls. If you’re using OpenMP you can ask the OpenMP runtime to set affinities for you using the KMP_AFFINITY environment variable. If you use the “explicit” form of affinity, you can give the precise set of logical CPUs to which to bind each thread (so you still need to understand the hardware to logical CPU mapping above).

You can find more information at the following compiler reference:
Key Features > openMP* support > OpenMP* Library Support > Thread Affinity Interface (Linux* and Windows*)
______________________________________________________________________________________________________

Q) What is the consistency of floating-point results using the Intel® Compiler?
OR
Q) Why doesn’t my application always give the same answer?

To find more about the consistency of floating-point results and the related compiler options, please visit the following link:

http://software.intel.com/en-us/articles/consistency-of-floating-point-results-using-the-intel-compiler/

To find more about the differences in floating-point arithmetic between Intel Xeon processors and the Intel Xeon Phi coprocessor, please refer to the following white paper:

https://software.intel.com/en-us/articles/differences-in-floating-point-arithmetic-between-intel-xeon-processors-and-the-intel-xeon
______________________________________________________________________________________________________

Q) Should I use explicit prefetching?

Generally, explicit software prefetching helps the most when the data access pattern is convoluted and it is impossible for the compiler to optimize effectively. So if most of your loads are gather instructions, then a good option is to consider explicit prefetching.

On the other hand, the compiler heuristics can do a decent job of prefetching the data. The compiler uses all available information to compute the prefetch distance for each loop. The information that is most useful for the compiler, and that it does not have in many case, is the estimate for the "average" trip-count for a loop. In such cases, the best way to provide this information to the compiler is to use the loop_count pragma just before the loop. Please refer to the documentation of this pragma in the compiler reference guide at:

Compiler Reference>Pragmas>Intel-Specific Pragma Reference>loop_count
______________________________________________________________________________________________________

Q) How do I perform an asynchronous / non-blocking transfer to Intel Xeon Phi coprocessor?

Asynchronous data transfers, also known as non-blocking transfers, can be made to the coprocessor using the offload_transfer pragma. More information can be found about this pragma at the following compiler reference:

Compiler Reference > Intel-specific Pragma Reference > offload_transfer

Some important things to keep in mind while using offload_transfer:

Always explicitly state the coprocessor that each offload or offload_transfer is going to use. For e.g. #pragma offload target(mic:0) or #pragma offload_transfer target(mic:1) will use coprocessor 0 for the offload and coprocessor 1 for the offload_transfer
Always remember to use alloc_if and free_if modifier to control memory persistence. In most cases, it is more convenient to have a single offload pragma that allocates memory before the start of the program and another offload pragma at the end that frees all the allocated memory. In the above scenario, remember that all intermediate offloads or offload_transfers should neither allocate nor free any memory.
Double buffering can provide improvement when using asynchronous data transfers.

For more examples as well as Best Known Methods (BKMs) on asynchronous transfers, please refer to the following link:

http://software.intel.com/sites/default/files/article/326700/6.2.1-asynchronous-offload.pdf
______________________________________________________________________________________________________

Q) Is peer-to-peer communication between coprocessors possible in the offload mode without MPI?

We do not support communication between cards in offload mode.
______________________________________________________________________________________________________

Q) If I compile my code for Intel MIC natively, how do I reverse offload some of the computations back to the CPU?

The compiler does not support reverse offload.
______________________________________________________________________________________________________

Q) What environment variables can be used to control and monitor the behavior of code offloaded to the Intel Xeon Phi coprocessor?

Several environment variables exist to control and monitor offload codes on the coprocessor. A small list of useful environment variables can be found at the following compiler reference:

Compilation> Setting Environment Variables.

Some other useful environment variables can be found at:

http://software.intel.com/en-us/articles/effective-use-of-the-intel-compilers-offload-features
______________________________________________________________________________________________________

Q) How do I mount NFS volumes onto the coprocessor?

The readme-en.txt for the Intel MPSS release provides most details needed to automount NFS filesystems on the card at boot time.

1. For the /etc/fstab entry that is mentioned, one adds to the card’s /etc/fstab, to automount at boot time, add this entry into the card's file image on the host under: /opt/intel/mic/filesystem/mic0/etc/fstab
For the example cited in the readme, add the line below into /opt/intel/mic/filesystem/mic0/etc/fstab:
172.31.1.254:/mic0fs /mic0fs nfs rsize=8192,wsize=8192,nolock,intr 0 0

2. To create the necessary mount point, add an entry into the mic0.filelist file on the host: /opt/intel/mic/filesystem/mic0.filelist
For the example cited in the readme, add the line shown below into the base.filelist:
dir /mic0fs 755 0 0
If you are running the Gold MPSS release (2.1.4346-16), reboot the card (micctrl -R) and the filesystem should mount at boot time.

For a multi-card configuration, here is one method to setup the same NFS mounted filesystem on all cards.
1. Under /opt/intel/mic/filesystem/common, create a sub-directory "etc" and place a copy of the fstab file from one of the "mic#" card fileystem images there. Edit the fstab and add the entry for the NFS filesystem. Next add the mount point and fstab file entries into common.filesystem.
For example, create /opt/intel/mic/filesystem/common/etc/fstab with:
devpts          /dev/pts        devpts defaults                0 0
tmpfs           /dev/shm        tmpfs   defaults                0 0
sysfs           /sys            sysfs   defaults                0 0
proc            /proc           proc    defaults                0 0
host:/micfs /micfs nfs rsize=8192,wsize=8192,nolock,intr 0 0
Create /opt/intel/mic/filesystem/common.filelist containing:
dir /micfs 644 0 0
file /etc/fstab etc/fstab 664 0 0

2. Under /opt/intel/mic/filesystem, remove the fstab file entry from each card's mic#.filesystem file. (Optional, under each card's filesystem image/opt/intel/mic/filesystem/mic# (e.g. /opt/intel/mic/filesystem/mic0) rename the file "etc/fstab")

3. On the host, add the appropriate entry for /etc/exports.
For example,
/micfs 172.31.0.0/255.255.0.0(rw,no_root_squash)
______________________________________________________________________________________________________

Q) Is OpenCL supported on Intel Xeon Phi coprocessor?

For more information please take a look at:

http://software.intel.com/en-us/blogs/2012/11/12/introducing-opencl-12-for-intel-xeon-phi-coprocessor
______________________________________________________________________________________________________

Q) Can I explicitly allocate memory in an offload?

Yes, you can explicitly allocate memory within an offload. For details regarding persistence of heap allocated memory, please refer to the following:

http://software.intel.com/en-us/articles/effective-use-of-the-intel-compilers-offload-features
______________________________________________________________________________________________________
Q) How do I conditionally compile code only for the Intel Xeon Phi coprocessor?

You can compile Intel MIC architecture-only code by protecting the code using #ifdef __MIC__

e.g.

#ifdef __MIC__

//Code for Intel MIC architecture goes in here

#endif

Please remember that #includes for certain headers files related to Intel MIC architecture should be protected in this manner.
______________________________________________________________________________________________________

Q) How do I compile code only for the Intel Xeon Phi coprocessor?

Compiling code for only the Intel Xeon Phi coprocessor, also known as compiling native compilation, can be done by using the –mmic compiler switch.
______________________________________________________________________________________________________

Q) Does Intel Xeon Phi coprocessor support third-party tools and libraries?

For the most up to date information on the third-party tools and library support for Intel Xeon Phi coprocessor please check the following page:

http://software.intel.com/en-us/articles/intel-and-third-party-tools-and-libraries-available-with-support-for-intelr-xeon-phitm
______________________________________________________________________________________________________

Q) How do I instantiate and manipulate shared versions of C++ STL vectors using _Cilk_shared and _Cilk_offload?

Shared versions of C++ STL vectors can be instantiate through the use of shared allocators defined in offload.h. Here is an example using shared allocators.

[cpp]

#include <vector>
#include <offload.h>
#include <stdio.h>

using namespace std;

typedef vector<int, __offload::shared_allocator<int> >

shared_vec_int;
_Cilk_shared shared_vec_int * _Cilk_shared v;

_Cilk_shared int test_result() {
int result = 1;
   for (int i = 0; i < 5; i++) {
      if ((*v) != i) {
         result = 0;
      }
   }
   return result;
}

int main() {

int result;

v = new (_Offload_shared_malloc(sizeof(vector<int>))) _Cilk_shared vector<int,__offload::shared_allocator<int>>(5);

   for (int i = 0; i < 5; i++) {
      (*v) = i;
   }

result = _Cilk_offload test_result();

   if (result != 1)
      printf("Failed\n");
   else
      printf("Passed\n");

return 0;
}

[/cpp]
______________________________________________________________________________________________________

Q) How do I profile native applications?

To profile a native application, please follow the steps provided here. You can also view this information by using the Intel VTune Amplifier XE 2013 Help for Linux* OS Reference manual and browsing through the table of contents as follows: Intel VTune Amplifier 2013 > User's Guide > Choosing Targets > Choosing a Target on the Intel Xeon Phi Coprocessor.

______________________________________________________________________________________________________

Q) Where can I find the description of the hardware performance counters for Intel Xeon Phi coprocessor?

Intel® Vtune™ analyzer provides a short description of the hardware performance counters when adding events to custom analysis. The description of some common performance counters as well as metrics can be found here.

______________________________________________________________________________________________________

Q) Where can I find more about the key features, peak performance and available SKUs of Intel Xeon Phi coprocessors?

Important information about Intel® Xeon Phi™ coprocessor can be found at:

http://download.intel.com/newsroom/kits/xeon/phi/pdfs/Intel-Xeon-Phi_Factsheet.pdf

______________________________________________________________________________________________________

Q) Can hand-written code optimized for SSE or Intel® Advanced Vector Extensions (Intel AVX) work for Intel Xeon Phi coprocessor?

No. The Intel Xeon Phi coprocessor does not support SSE or Intel AVX. Furthermore, the techniques used to produce optimal SSE or Intel AVX code need to be changed when adapting the implementation for the Intel Xeon Phi coprocessor. Code using SSE or Intel AVX assumes a vector length or 128-bits or 256-bits, respectively, while the Intel Xeon Phi coprocessor has a vector width of 512 bits. Thus the algorithm will need to be rewritten to effectively use the wider vector width, whether written by hand in intrinsics, or in a higher-level language that has been structured to enable the compiler to produce the best SSE or Intel AVX code.

______________________________________________________________________________________________________

Q) How can I reduce the memory allocation overhead in an offload?

Tips on minimizing coprocessor memory allocation overhead can be found on the following page under the section “Minimize Coprocessor Memory Allocation Overhead”.

http://software.intel.com/en-us/articles/effective-use-of-the-intel-compilers-offload-features

______________________________________________________________________________________________________

Q) Is there a way that I can automatically time each individual offload?

Yes, you can automatically time, each individual offload by using the environment variable OFFLOAD_REPORT. You can find out more about the offload report on the following webpage under the section “Environment Variables for Controlling Offload”:

http://software.intel.com/en-us/articles/effective-use-of-the-intel-compilers-offload-features

Alternately, the compiler reference for OFFLOAD_REPORT can be found at:

Compiler Reference: Setting Environmental Variables

Compiler Reference: __Offload_report

______________________________________________________________________________________________________

Q) How can I improve data transfer rate from the host to the coprocessor in an offload?

The data transfer rate from the host to the coprocessor can be improved by the following:

On the host, align data to 4KB boundaries for optimal DMA performance over the PCIe bus. To align data use _mm_malloc() instead of malloc() when allocating data.
Depending on the use alloc_if and free_if modifiers, an offload timing measurement can include memory allocation and free (on the coprocessor) overhead. Using persistent memory helps eliminate the allocation and free overheads and also provides more consistent timing. You can find more about persistent memory on the following webpage:

http://software.intel.com/en-us/articles/effective-use-of-the-intel-compilers-offload-features

______________________________________________________________________________________________________

Q) Where can I find more about the Intel Xeon Phi coprocessor Instruction Set Architecture (ISA)?

Intel Xeon Phi coprocessor ISA can be found at

http://software.intel.com/sites/default/files/forum/278102/327364001en.pdf

______________________________________________________________________________________________________

Q) Where can I find more documentation about the Performance Monitoring Units (PMUs) in the Intel Xeon Phi coprocessor?

You can learn more about the PMUs in the Intel Xeon Phi coprocessor in the Software Developer’s Guide that be found at:

http://software.intel.com/sites/default/files/forum/278102/327364001en.pdf

______________________________________________________________________________________________________

Q) How do I implement memory fences on the Intel Xeon Phi coprocessor?

Since Intel Xeon Phi coprocessor is an in-order machine, it does not normally require any instructions to enforce the ordering of memory instructions, as they naturally become globally visible in program order. Therefore it is normally sufficient to implement memory barriers in compiled code as a simple compiler barrier ('__asm__ __volatile__("":::"memory")') which ensures that the compiler does not reorder loads and stores over the barrier, while generating no code.

The exceptions to this are the Non-Globally-Ordered (NGO) stores. If you explicitly code these using assembly instructions or intrinsics, then you do need to insert a memory fence.

The best known memory fence implementations are

If a store instruction is present at the point where the barrier is required, then replace it with an xchg; since xchg is a locked operation (even though it has no lock prefix) it is automatically a full memory fence.
If there is not a convenient store, then use lock; addl $0,(%rsp). This is also a locked instruction (so a full memory fence) that has no other effect. Provided that the stack is still in the cache, it seems to complete on the Intel Xeon Phi coprocessor in four cycles, which is much faster than using cpuids ( which was another option that has been suggested).

______________________________________________________________________________________________________

Q) How can I improve software prefetching to get better performance?

Software prefetching basics, guidelines, and best-know-methods can be found at:

http://software.intel.com/sites/default/files/article/326703/5.3-prefetching-on-mic-4.pdf

______________________________________________________________________________________________________

Q) Does Intel Xeon Phi coprocessor support the PAPI hardware counter library?

Intel currently does not support the PAPI hardware counter library. You can find some third party work at :

http://www.eece.maine.edu/~vweaver/projects/mic/

______________________________________________________________________________________________________

Q) What do the performance hotspots in kmp_wait_sleep and kmp_static_yield imply?

kmp_wait_sleep is where a thread waits inside the OpenMP runtime if it has nothing to do. There are a number of scenarios when a thread has nothing to do.

The most significant cases are:

There could be at an explicit OpenMP barrier in the code, and some threads are waiting for the others to reach the barrier
Some threads are waiting for the other threads at the implicit "join-barrier" found at the end of every parallel section (unless the nowait clause is used)
The OpenMP thread pool could be waiting for a serial section of code to finish

The first two cases imply a load imbalance. The last case results in a hot spot when an algorithm jumps between parallel and serial execution a lot in a performance-critical area, which usually also amplifies any load imbalances in the parallel portions of the algorithm.

kmp_static_yield is effectively the same place; This is where the runtime is delaying and is called from kmp_wait_sleep.

So a large amount of time in these routines can mean that you have a load-imbalance, and/or that you aren’t exploiting all of the threads you have available effectively.

Remember that on the Intel Xeon Phi coprocessor, you have a large number of threads, so if you’re using a static loop scheduling (which is the default) and even if there’s no variance in the time for each iteration, you may get significant imbalance even in cases that would have been fine on a machine with eight or 16 threads.

For instance, a loop with 256 iterations run on 240 hardware threads will be processed in two batches: after the first 240 iterations are processed, the remaining 16 iterations will be processed. Since there are a total of 480 work units available during the processing of this loop, you’ll waste 224, for a maximum efficiency of 53%.

______________________________________________________________________________________________________

Q) How do I disable automatic offloads for a specific Intel® MKL call, or on a specific coprocessor?

You can disable automatic offloads by calling mkl_mic_disable().

Alternately, you can use mkl_mic_set_workdivision to assign the entire computation to only the host and effectively disable offloads. This can be done as:

mkl_mic_set_workdivision(MKL_TARGET_HOST,0,1.0);

You can also set the work division to zero on each coprocessor to completely disable offloads or just one to disable on a specific coprocessor. For e.g.

mkl_mic_set_workdivision(MKL_TARGET_MIC,0,0.0);

______________________________________________________________________________________________________

Q) Which Intel MKL Functional domains are supported on Intel Many Integrated Core (Intel MIC) Architecture?

The Intel MKL 11.0 Update 2 supports the following functional domains on Intel MIC Architecture:

Optimized

BLAS level 3, and much of level 1 and 2

Sparse BLAS: CSRMV, CSRMM (Native only)
Some important LAPACK routines: (LU, QR, Cholesky)
Fast Fourier Transforms
Vector Math Library (Native Only)
Random number generators in the Vector Statistical Libraries

Not Supported

Poisson Solver
Iterative Sparse Solvers
Trust Region Solvers.

Supported

Everything else.

To find more information, please visit this page.

______________________________________________________________________________________________________

Q) How can I use Intel MKL and third-party applications together?

Articles describing how to use the Intel MKL with other third-party libraries and application such as Numpy*, Scipy*, Matlab*, C#, Python*, and NAG* can be found here.

______________________________________________________________________________________________________

Q) How can I control the threading in Intel MKL routines?

You can control the parallelization within the Intel MKL routines with by using MKL threading support functions and some environment variables. You can read more at:

http://software.intel.com/en-us/articles/intel-math-kernel-library-intel-mkl-using-intel-mkl-with-threaded-applications/

______________________________________________________________________________________________________

Q) How do I run the Intel® MPI Library on the Intel Xeon Phi coprocessor?

You can find more about running Intel MPI on the Intel Xeon Phi coprocessor at

http://software.intel.com/en-us/articles/how-to-run-intel-mpi-on-xeon-phi

______________________________________________________________________________________________________

Q) Which Intel MPI-files do I need to transfer to the coprocessor for my MPI application?

Several binaries and libraries need to be transferred to the Intel Xeon Phi coprocessor to execute MPI applications on the coprocessor. You can find more information at:

http://software.intel.com/en-us/articles/mpi-specific-files-for-intel-xeon-phi-what-is-needed

______________________________________________________________________________________________________

Q) How can I pin Intel MPI processes on the Intel Xeon Phi coprocessor?

Information on pinning Intel MPI processes on the Intel Xeon Phi coprocessor can be found at:

http://software.intel.com/en-us/articles/mpi-and-process-pinning-on-xeon-phi

______________________________________________________________________________________________________

Q) How do I pin processes and associated threads under Intel MPI in a hybrid MPI-OpenMP model in native mode?

You can pin processes and associated threads using I_MPI_PIN_DOMAIN and KMP_AFFINITY environment variables.

For 4 or fewer OpenMP threads per MPI process set the variables as follows:

I_MPI_PIN_PIN_DOMAIN=core;

For e.g. to pin 4 threads per process use:

I_MPI_PIN_DOMAIN=core; OMP_NUM_THREADS=4;

For more than 4 OpenMP threads per MPI process set the variables as follows:

I_MPI_PIN_DOMAIN=omp; KMP_AFFINITY=compact;

For e.g. to pin 8 threads per process use:

I_MPI_PIN_DOMAIN=omp; KMP_AFFINITY=compact; OMP_NUM_THREADS=8

In this case, remember to set OMP_NUM_THREADS as a multiple of four, to avoid splitting cores.

Also, setting I_MPI_DEBUG=5 reveals the MPI process affinity map.

You can read more about OpenMP-MPI interoperability at this page.

______________________________________________________________________________________________________

Q) How do I pin threads within an offload created by an Intel MPI process?

This is the case where you have multiple mpi processes on the host where each process loffloads some work to the Intel Xeon Phi coprocessor. To keep the multiple MPI processes from running their offloaded code on the same threads, careful setting of the OpenMP KMP_AFFINITY environment variable is required. . To pin threads and prevent interference, use a long command line with KMP_AFFINITY proclist settings as shown below:

mpiexec.hydra

-env MIC_OMP_NUM_THREADS 10 -env MIC_KMP_AFFINITY granularity=fine,proclist=[1-8],explicit -n 1 ./myApp

: -env MIC_OMP_NUM_THREADS 10 -env MIC_KMP_AFFINITY granularity=fine,proclist=[9-16],explicit -n 1 ./myApp

: -env MIC_OMP_NUM_THREADS 10 -env MIC_KMP_AFFINITY granularity=fine,proclist=[17-24],explicit -n 1 ./myApp

: -env MIC_OMP_NUM_THREADS 10 -env MIC_KMP_AFFINITY granularity=fine,proclist=[24-32],explicit -n 1 ./myApp

In the above example, there is one argument set per MPI process separated by a “:” which is used as separators of different argument sets.

For more information, please visit this page.

______________________________________________________________________________________________________

Q) Does the Intel Xeon Phi coprocessor have support for Berkeley Lab Checkpoint/Restart (BLCR)?

Currently the Intel Xeon Phi coprocessor has no support for BLCR.

______________________________________________________________________________________________________

joeli · ‎06-05-2013

The link to documentation regarding profiling a native application seems to be broken.

Sumedh_N_Intel · ‎06-05-2013

Thank you for bringing this to our attention. I have updated the url. If you are looking for documentation on profiling applications on the coprocessor, you can also view the videos tutorials on Intel VTune Amplifier on software.intel.com/mic-developer ( Under Trainings > Tutorials)

Aswini_S · ‎12-30-2013

Thank you... Very helpful guide!

Jess · ‎03-19-2014

Any idea when Intel are going to properly support the Phis in Intel MKL? At the moment they perform really, really poorly as the only routines that appear to have been optimised are those used for the LINPACK benchmark. What we really need are things like FFTW, which has never been that well supported. The vast majority of HPC software being used at our site relies heavily on FFTW in particular, and having recompiled some of it to try to get it working on the Phis, we find the performance is absolutely dire, making the cards an expensive ornament.

Sumedh_N_Intel · ‎03-19-2014

Jess wrote:

Any idea when Intel are going to properly support the Phis in Intel MKL? At the moment they perform really, really poorly as the only routines that appear to have been optimised are those used for the LINPACK benchmark. What we really need are things like FFTW, which has never been that well supported. The vast majority of HPC software being used at our site relies heavily on FFTW in particular, and having recompiled some of it to try to get it working on the Phis, we find the performance is absolutely dire, making the cards an expensive ornament.

The Fast Fourier Transform routines in Intel MKL library have been optimized for use on the Intel MIC Architecture. Perhaps this page will give you an idea of the peak performance that you can expect out of the Intel MIC architecture. Are there any particular routines that you are interested in?

ashish_s_ · ‎08-20-2014

Can you provide the link to download the Gromacs RF workload used in the Intel article "Gromacs with Intel PHi"?

Thanks,

AKS

ashish_s_ · ‎08-20-2014

Hi,

I am running miniFE with Intel MIC. I successfully run tests with Intel MIC.

Can someone help me to understand the benchmark parameters for miniFE?

Thanks,

AKS

sa_A_ · ‎09-11-2015

Thank you very much very useful information I learn your way çelik kapı

Chronus_Taizen · ‎12-04-2016

Part 3 of the answer to "Q) What is processor affinity and how do we set it?" is missing the picture. A "Page not found." when I went to the image URL. Can you please include a valid link so I can see that image? That is a critical piece of information.

Thank you.

Chronus_Taizen · ‎12-04-2016

All the picture links are broken. Could you please fix this? Especially the second picture about "Mapping Hardware to Logical CPUs".

Thanks.

Yuri_S_ · ‎12-12-2016

Hi dear,

We are interested to learn more about Intel Xeon Phi but the material avaiable does not answer some questions we have.

1) I have a datacenter with Intel Xeon processors and Hadoop as a data lake. What do I need to start using the Intel Xeon Phi libraries and his benefits? Do I just need to buy Intel Xeon Phi and install it in the PCI Express of each machine? Do I need one end node?

2) I have Spark, Hive, Hadoop and Python. I am looking for some example code but I can not find it. How to import the libraries and how to code in PySpark or Python to get the benefis of Intel Xeon Phi?

I am looking forward to get an answer.

Best Regards