Most efficient way for atomic updates on Xeon Phi

kadir · ‎04-21-2014

I have found out that __kmpc_atomic_float4_add was used in the assembly code of the following two lines:

#pragma omp atomic
array += 1.0;

Performance of this code is not good on Intel Xeon Phi when many threads are used. Is there any information about how __kmpc_atomic_float4_add is implemented? Are there any better solutions for efficient and scalable atomic updates? Is it possible to use GCC intrinsics such as __sync_add_and_fetch() in offload regions?

jimdempseyatthecove · ‎04-21-2014

You cannot use a "LOCK, add to memory" as performed by __synch_fetch_and_add though you can perform something like:

do {
float temp = array;
float result = temp + 1.0f;
} while(!CAS(&array, temp, result));

The best way is to partition the code such that no two threads will simultaneously update the same location within the array.

Reference: http://en.wikipedia.org/wiki/Compare-and-swap

Jim Dempsey

James_C_Intel2 · ‎04-22-2014

Is there any information about how __kmpc_atomic_float4_add is implemented?

Sure, the whole of the OpenMP* runtime sources are available (either from http://openmprtl.org or http://openmp.llvm.org ). So you can see exactly how they are implemented. (Which is effectively as Jim D describes).

jimdempseyatthecove · ‎04-22-2014

Kadir,

You might want to look at using reduction variables and syntax as used by OpenMP

double sum = 0.0;
// sum is private within parallel region
// ** However, upon exit of parallel region operator(+) performed on outer scope sum
// this operation is performed in a thread-safe manner
#pragma omp parallel for reduction(+:sum)
for(int i=0; i < N; ++i) {
  sum += a; }

Jim Dempsey

kadir · ‎04-23-2014

Dear Jim,

I have to perform reduction on an array. Sorry for the example that I give since it does not consider my real need. What is the most efficient way to reduce multiple arrays into one array in parallel on MIC architecture? I am using C/C++, not Fortran.

jimdempseyatthecove · ‎04-24-2014

Divide the output array into sections (often called tiles) and have only one thread write to any one section. This way you will not require atomics or locks.

If work per cell in output array is relatively the same then for N threads make N tiles. (e.g. static partitioning and scheduling)

If work per cell varies, then consider more partitions and dynamic scheduling.

For some situations consider a plesiochronous phasing barrier. Here is an article I wrote (https://software.intel.com/en-us/blogs/2014/02/22/the-chronicles-of-phi-part-5-plesiochronous-phasing-barrier-tiled-ht3) or some variation there upon. You might want to read the first 4 parts of that series of blog to give you some background insight as to the problem, solution, problem, solution, ... iterations that lead to the final solution.

As with most optimizing situations you will find that some of the promising steps you take at the beginning of the process yield less than expected results. In trying to understand why this happened (or did not happen), this leads you to an improved path for solution. I am of the philosophy that it is better to teach someone to learn how to figure it out as opposed to telling them the (a) solution. If you did your teaching right, then the student may out perform the teacher.

Jim Dempsey

jimdempseyatthecove · ‎04-24-2014

I forgot to mention. Be mindful that the strength of the Xeon Phi is not necessarily with the number of cores and hardware threads. Its real strength lies in the wide vector units (64 bytes, 16 floats, 8 doubles).

Keep this in mind such that your partitioning scheme favors vectorization. This may also affect how you collect the input data and/or layout.

Jim Dempsey

kadir · ‎05-22-2014

I was not able to compile following code using `icc`:

float temp;
float result;
do {
  temp = array;
  result = temp + 1.0f;
} while(!CAS(&array, temp, result));

I have found  a solution in C++. However, I am using C language. Are there any solutions in C language?

jimdempseyatthecove · ‎05-22-2014

You will have to write your own CAS (Compare And Swap). When your compiler error indicated missing function named CAS your first course of action is to perform a web search for "CAS" as it relates to computer programming. You will find that CAS is Compare And Swap. This is an abstract function name used in computer programming papers. Various compilers have different named functions an with different argument orders and return values. There are usually flavors of CAS for byte, word, dword, qword, dqword. Not all processors support all the different word lengths. There is a similar abstract function DCAS (Double Compare And Swap), and various other function.

Here are some of the functions you might use

http://stackoverflow.com/questions/2975485/atomic-swap-with-cas-using-gcc-sync-builtins

http://msdn.microsoft.com/en-us/library/windows/desktop/ms683560(v=vs.85).aspx

http://msdn.microsoft.com/en-us/library/windows/desktop/ms683562(v=vs.85).aspx

Using "float" you would select the function that uses a dword (4 bytes).

It is your responsibility to assure that the (destination) variable being swapped is located in RAM and is assured to have the most recent written value. This may require memory barrier and/or volatile attributes. You do not want the compiler to optimize away your intended function.

Note, CAS is not functional for values stored in SSE/AVX/AVX2/AVX512 registers. Computational results using SSE/AVX/AVX2/AVX512 will have to be stored into a local float, preferably with volatile, such that it can be fetched into a GP register.

Jim Dempsey

jimdempseyatthecove · ‎05-22-2014

The link you found is fine. You will need to include the TBB header file.

Note, the example shown in the link to TBB was using int (4 bytes) and as written would not be suitable for float. The "o = x" will perform a float to int conversion. You would have to modify the code to store the float then reinterpret cast to fetch the bit pattern as int.

Jim Dempsey