Software Archive
Read-only legacy content
Announcements
FPGA community forums and blogs on community.intel.com are migrating to the new Altera Community and are read-only. For urgent support needs during this transition, please visit the FPGA Design Resources page or contact an Altera Authorized Distributor.
17060 Discussions

Most efficient way for atomic updates on Xeon Phi

kadir
Beginner
988 Views

I have found out that __kmpc_atomic_float4_add was used in the assembly code of the following two lines:

#pragma omp atomic
array += 1.0;

Performance of this code is not good on Intel Xeon Phi when many threads are used. Is there any information about how __kmpc_atomic_float4_add is implemented? Are there any better solutions for efficient and scalable atomic updates? Is it possible to use GCC intrinsics such as __sync_add_and_fetch() in offload regions?

0 Kudos
9 Replies
jimdempseyatthecove
Honored Contributor III
988 Views

You cannot use a "LOCK, add to memory" as performed by __synch_fetch_and_add though you can perform something like:

do {
float temp = array;
float result = temp + 1.0f;
} while(!CAS(&array, temp, result));

The best way is to partition the code such that no two threads will simultaneously update the same location within the array.

Reference: http://en.wikipedia.org/wiki/Compare-and-swap

Jim Dempsey

0 Kudos
James_C_Intel2
Employee
988 Views

Is there any information about how __kmpc_atomic_float4_add is implemented? 

Sure, the whole of the OpenMP* runtime sources are available (either from http://openmprtl.org or http://openmp.llvm.org ). So you can see exactly how they are implemented. (Which is effectively as Jim D describes).

0 Kudos
jimdempseyatthecove
Honored Contributor III
988 Views

Kadir,

You might want to look at using reduction variables and syntax as used by OpenMP

double sum = 0.0;
// sum is private within parallel region
// ** However, upon exit of parallel region operator(+) performed on outer scope sum
// this operation is performed in a thread-safe manner
#pragma omp parallel for reduction(+:sum)
for(int i=0; i < N; ++i) {
  sum += a; }

Jim Dempsey

0 Kudos
kadir
Beginner
988 Views

Dear Jim,

I have to perform reduction on an array. Sorry for the example that I give since it does not consider my real need. What is the most efficient way to reduce multiple arrays into one array in parallel on MIC architecture? I am using C/C++, not Fortran.

0 Kudos
jimdempseyatthecove
Honored Contributor III
988 Views

Divide the output array into sections (often called tiles) and have only one thread write to any one section. This way you will not require atomics or locks.

If work per cell in output array is relatively the same then for N threads make N tiles. (e.g. static partitioning and scheduling)

If work per cell varies, then consider more partitions and dynamic scheduling.

For some situations consider a plesiochronous phasing barrier. Here is an article I wrote (https://software.intel.com/en-us/blogs/2014/02/22/the-chronicles-of-phi-part-5-plesiochronous-phasing-barrier-tiled-ht3) or some variation there upon. You might want to read the first 4 parts of that series of blog to give you some background insight as to the problem, solution, problem, solution, ... iterations that lead to the final solution.

As with most optimizing situations you will find that some of the promising steps you take at the beginning of the process yield less than expected results. In trying to understand why this happened (or did not happen), this leads you to an improved path for solution. I am of the philosophy that it is better to teach someone to learn how to figure it out as opposed to telling them the (a) solution. If you did your teaching right, then the student may out perform the teacher.

Jim Dempsey

0 Kudos
jimdempseyatthecove
Honored Contributor III
988 Views

I forgot to mention. Be mindful that the strength of the Xeon Phi is not necessarily with the number of cores and hardware threads. Its real strength lies in the wide vector units (64 bytes, 16 floats, 8 doubles).

Keep this in mind such that your partitioning scheme favors vectorization. This may also affect how you collect the input data and/or layout.

Jim Dempsey

0 Kudos
kadir
Beginner
988 Views
I was not able to compile following code using `icc`:

float temp;
float result;
do {
  temp = array;
  result = temp + 1.0f;
} while(!CAS(&array, temp, result));

 

I have found  a solution in C++. However, I am using C language. Are there any solutions in C language?

 

0 Kudos
jimdempseyatthecove
Honored Contributor III
988 Views

You will have to write your own CAS (Compare And Swap). When your compiler error indicated missing function named CAS your first course of action is to perform a web search for "CAS" as it relates to computer programming. You will find that CAS is Compare And Swap. This is an abstract function name used in computer programming papers. Various compilers have different named functions an with different argument orders and return values. There are usually flavors of CAS for byte, word, dword, qword, dqword. Not all processors support all the different word lengths. There is a similar abstract function DCAS (Double Compare And Swap), and various other function.

Here are some of the functions you might use

http://stackoverflow.com/questions/2975485/atomic-swap-with-cas-using-gcc-sync-builtins

http://msdn.microsoft.com/en-us/library/windows/desktop/ms683560(v=vs.85).aspx

http://msdn.microsoft.com/en-us/library/windows/desktop/ms683562(v=vs.85).aspx

Using "float" you would select the function that uses a dword (4 bytes).

It is your responsibility to assure that the (destination) variable being swapped is located in RAM and is assured to have the most recent written value. This may require memory barrier and/or volatile attributes. You do not want the compiler to optimize away your intended function.

Note, CAS is not functional for values stored in SSE/AVX/AVX2/AVX512 registers. Computational results using SSE/AVX/AVX2/AVX512 will have to be stored into a local float, preferably with volatile, such that it can be fetched into a GP register.

Jim Dempsey

0 Kudos
jimdempseyatthecove
Honored Contributor III
988 Views

The link you found is fine. You will need to include the TBB header file.

Note, the example shown in the link to TBB was using int (4 bytes) and as written would not be suitable for float. The "o = x" will perform a float to int conversion. You would have to modify the code to store the float then reinterpret cast to fetch the bit pattern as int.

Jim Dempsey

0 Kudos
Reply