I am using Xeon Phi configured with Intel Parallel Studio that provide OpenMP libraries too. I am wondering what is the best way to understand how the "Balanced" affinity was implemented in the *code* of OpenMP and which source should I refer for that.
If anyone can answer following, it will help:
1) Does Intel Parallel Studio uses Intel OpenMP source code listed here: https://www.openmprtl.org/
2) What is the best way to understand how balanced was implemented in OpenMP source code by Intel?
3) What should be the approach if one wants to implemented its own thread affinity scheme in OpenMP and get it working with Intel Caffe?
Much of the "optimized" code in MKL comes in both OpenMP and TBB parallel versions. This matters to us only in that we should consider this when parallelizing code according to those models. Much optimized MKL code is probably written with extensive use of SSE intrinsics, not intended for us to understand. Many MKL functions are functionally compatible with open source ones, which however are difficult to parallelize.
You could look at the llvm openmp source to see if it has an implementation of balanced affinity. When it works, it will spread threads evenly across cores while keeping the assignment consecutive. For example, on a dual CPU platform, threads 0 through n/2-1 will go on one CPU and the remainder on the other. This may be a great advantage over scatter with a properly written application which takes advantage of adjacent threads sharing cache.
It would a waste of your time to try to implement an affinity scheme yourself, as that is a significant part of the value of Intel performance libraries and your OpenMP implementation. It took years for OpenMP developers to implement a useful scheme. You would need separate versions for Windows (Microsoft threaded) and linux or Mac (pthreads). Of course, you would need to define your goals, which you haven't mentioned here.
I'm surprised that the documentation for caffe isn't more clear on whether it uses any threading, or leaves it entirely to MKL. When MKL (in the OpenMP version) is employed without the calling application using threading, it has an effective internal affinity implementation. If you look up the use of nested OpenMP parallelism with MKL, you will see no such recommendation from Intel and some relevant comments from us. If the calling application uses TBB threading and calls OpenMP from a parallel region, you will have the problems that you will need to coordinate the number of threads at both levels, the affinity scheme of MKL will be destroyed, and there is a BLOCKTIME delay in MKL releasing threads after an MKL function completes.