Solved: Implementation of OpenMP Scatter, Compact and Balanced Mode

CPati2 · ‎10-23-2017

Hi All,

If I am correct, then OpenMP thread distribution modes like scatter, compact and balanced are implemented specifically for Xeon Phi and aren't supported in general by OpenMP library.

Is there any documentation that I can refer to understand how this is implemented? In other words, what part of the code and of which library or software is called when we set environment variables like: KMP_AFFINITY=<compact/scatter/balanced>?

Thanks.

CPati2 · ‎10-30-2017

Hi John, Tim and Jim,

I found the source code of Intel OpenMP hosted here.

Basically, if anyone modifies this code, then compiles and points to use this library, then the system will use new implementation rather than the default one. This is what I wanted to understand.

Thanks.

View solution in original post

McCalpinJohn · ‎10-23-2017

Only the "balanced" affinity is specific to the Xeon Phi processors.

The "compact" and "scatter" affinity are supported for all processors, and are described in detail in the C and Fortran compiler reference manuals.

CPati2 · ‎10-23-2017

Hi John,

I will have to refer the ICC compiler reference manual and not the GNU one, correct? Also, that will share details on what code implementation of these two modes?

Thanks.

jimdempseyatthecove · ‎10-24-2017

GNU will (will likely) not support the Intel KMP_... environment variables. You will have to use equivalent/similar OMP_... and GOMP_... environment variables.

See: https://gcc.gnu.org/onlinedocs/libgomp/Environment-Variables.html

Jim Dempsey

TimP · ‎10-24-2017

In my experience with Intel MPI and OpenMP, KMP_AFFINITY=balanced is useful primarily when running Intel OpenMP under MPI in MPI_FUNNELED mode, where you have multiple cores assigned to each MPI rank, each running multiple threads but less than 4 threads per core.

If you run plain OpenMP (just 1 rank), the KMP_HW_SUBSET facility is more convenient. It implies a balanced setting.

As Jim hinted, there are portable OpenMP environment variables which should be implemented in gnu OpenMP on linux (although not on all target OS). The GOMP_CPU_AFFINITY should be implemented on Intel OpenMP as well as Gomp. gnu and standard OpenMP settings will be over-ridden in Intel OpenMP by any conflicting Intel specific environment variable.

For KMP_AFFINITY=compact and scatter, the OpenMP standard settings OMP_PROC_BIND=close and spread are equivalent. Apparently, OMP_PROC_BIND=true defaults to spread https://software.intel.com/en-us/node/695719

OMP_PLACES=cores also will cover those cases where you set 1 thread per core.

Gnu OpenMP documentation doesn't tell which targets have affinity implementation (Windows, for example, does not). But I suppose you won't be trying gnu openmp for windows on MIC. Gnu OpenMP should warn at run time about any OpenMP standard environment settings which aren't implemented.

CPati2 · ‎10-24-2017

HI Tim,

Thank you for detailed reply.

My question's main goal is to understand how scatter and compact are implemented, i.e. how can someone implement a new scheme on it's own? Exactly what will I need to understand and read to do so? Scatter is able to distribute threads evenly, but using how does this happen on the code or library side?

I hope this is clear.

Thanks.

TimP · ‎10-25-2017

Affinity in openmp based on pthreads is implemented by setting pthreads affinity mask. The Intel implementation presumably resembles the one in llvm openmp.

CPati2 · ‎10-25-2017

Hi Tim,

Thank you.

I made one observation. I have 64 physical cores (256 threads with 4 threads per core) on Xeon Phi 7210. I ran same benchmark but with following conditions for 8 threads in compact mode:

Condition 1: All 64 physical cores (256 threads) are online.
Condition 2: Only, 2 cores(8 threads) are online.

I see performance degradation (drastic) in condition 2. Would you know why? I think in any case I do give benchmark 8 threads (2 cores) which as per compact mode should be utilized based on 8 threads I am spawning. Is it that how compact mode is used is based on how many physical cores are online?

Hi John and Jim: Please share your views too.

Thanks.

McCalpinJohn · ‎10-25-2017

It is very easy to get a large performance degradation if you attempt to use all the logical processors in the system for a single parallel user job.

For a description of the KMP_AFFINITY modes, go to https://software.intel.com/en-us/cpp-compiler-18.0-developer-guide-and-reference and download the C++ Compiler Reference Manual. Read the section entitled "Thread Affinity Interface", pages 2188 to 2198.

CPati2 · ‎10-25-2017

Hi John,

I think my question is not regarding this.

I give compact enough resources to run 8 threads. In first case I give it 64 (I am not turning any cores off via sysfs before running benchmark in compact mode) cores to spawn 8 threads, so compact will fill core 0 and 1 with all 8 threads.

In second case I am giving compact mode 2 cores (i.e. 4 threads per core == 8 threads AND I am turning 62 cores off and only 2 cores are online via sysfs before running the benchmark in compact mode), so it should fill all these cores with threads task.

Difference is number of cores that are online (via sysfs). So, is it that compact mode calculates how to map threads based on how many cores are online via sysfs? OR is it that OS task apart from the benchmark running are affecting this due to less number of physical cores available to schedule task?

I don't see performance degradation when I do similar analysis for scatter mode.

Thanks.

CPati2 · ‎10-30-2017

Hi John, Tim and Jim,

I found the source code of Intel OpenMP hosted here.

Basically, if anyone modifies this code, then compiles and points to use this library, then the system will use new implementation rather than the default one. This is what I wanted to understand.

Thanks.

James_C_Intel2 · ‎10-31-2017

I found the source code of Intel OpenMP hosted here.

Please use the LLVM OpenMP runtime from http://openmp.llvm.org; it is effectively identical to the one that we ship withe the compilers, but is up to date, whereas the site you point to is no longer maintained, and will not have up to date versions of the runtime. (We took the decision not to keep updating that site since the LLVM code serves the same purpose, and it saves us a pile of work if we don't need to produce copies and validate them :-)).

Basically, if anyone modifies this code, then compiles and points to use this library, then the system will use new implementation rather than the default one. This is what I wanted to understand.

That seems a very different issue from that which you were previously asking about, which was to do with affinity, performance and so on.But I guess if what you really meant to ask was "How can I look at the OpenMP runtime code?" you now have the answer. (But asking that directly would have been simpler!)

CPati2 · ‎10-31-2017

Hi James,

I understand. Intel documentation on software projects is so detailed and to the point, that my question tend to follow that pattern!.

Thanks.

jimdempseyatthecove · ‎10-31-2017

John>>It is very easy to get a large performance degradation if you attempt to use all the logical processors in the system for a single parallel user job.

This depends on the application. For the vast majority of applications I would concur that less than 4 threads per core would be optimal. I suggest that you test with 4 threads per core as well as you may have one of those minority applications that benefit using all 4 threads per core. As it so happens I am working a simulation program that is a mix of OpenMP and MPI which does perform better using all 4 threads per core. This application manipulates many instances of smaller collections of small arrays (~4x4, 6x6, where on KNL I convert 6x6 to 6x8, small arrays use in matrix multiply). This application also uses OpenMP Tasks as opposed to OpenMP DO (for) loops. So maybe OpenMP task-based programs with relatively short loops of half-way vectorizable data might benefit from all HT's.

A typical simulation run on 1 KNL (7210) takes 26 hours, 39 minutes) to run 8 years of simulation time. This application scaled well on up to 8 KNL nodes on the Colfax Cluster. The 16 and 32 nodes tests couldn't get availability of nodes on their cluster so I do not know how well it scales beyond 8 nodes. The initial production system may have 12 or 16 nodes of KNLs.

Jim Dempsey

jimdempseyatthecove · ‎10-31-2017

FWIW: For a production system, we are interested in total throughput of the system and not necessarily in obtaining maximum flops per logical processor. IOW we are happy to trade off 10% of flops, in order to gain 33%-(10% of 33%) of throughput.

Jim Dempsey

James_C_Intel2 · ‎10-31-2017

This application manipulates many instances of smaller collections of small arrays (~4x4, 6x6, where on KNL I convert 6x6 to 6x8, small arrays use in matrix multiply)

Jim, I hope you have evaluated libXSMM.:

LIBXSMM is a library for small dense and small sparse matrix-matrix multiplications as well as for deep learning primitives such as small convolutions targeting Intel Architecture. Small matrix multiplication kernels are generated for the following instruction set extensions: Intel SSE, Intel AVX, Intel AVX2, IMCI (KNCni) for Intel Xeon Phi coprocessors ("KNC"), and Intel AVX‑512 as found in the Intel Xeon Phi processor family (Knights Landing "KNL", Knights Mill "KNM") and Intel Xeon processors (Skylake-SP "SKX"). Historically small matrix multiplications were only optimized for the Intel Many Integrated Core Architecture "MIC") using intrinsic functions, meanwhile optimized assembly code is targeting all afore mentioned instruction set extensions (static code generation), and Just‑In‑Time (JIT) code generation is targeting Intel AVX and beyond. Optimized code for small convolutions is JIT-generated for Intel AVX2 and Intel AVX‑512.

jimdempseyatthecove · ‎10-31-2017

James,

Thanks, I will give that a look. The libXSMM may or may not be suitable for application I have. It is written in Fortran and the particular tasks have approximately 20 layers of 6x6 matricies to which an incoming 6x6 is chain multiplied going one way across the chain, then reflected back in reverse order after conditioning the first result. I've rearranged for the initial 6x6 to be an 8x6 (8 columns of doubles x 6 rows) then produce a row of columns result cache line by cache line. This code optimized very well producing interleaved and aligned 512-bit vector operations unrolled 6 times (for 6 rows) and looped for the number of layers. I cannot conceptually see how it can be done any faster.

I am examining the code to see if the 20 layer of 6x6 arrays is re-used at least more than once. If so, I am considering on testing producing a transformation 6x6 array such that the complete transformation can be performed with a single 6x6 matrix multiply with 6x6 (or 8x6). I am not sure at this time of re-use of these input arrays, I will have to take statistical samples and see what shows up.

Jim Dempsey

James_C_Intel2 · ‎11-01-2017

James,

It sounds as if you have it all well under control, but at least now you have another tool in your tool-chest which may be useful elsewhere!

-- Jim

Reddy__Nani · ‎11-16-2017

Tim P. wrote:

In my experience with Intel MPI and OpenMP, KMP_AFFINITY=balanced is useful primarily when running Intel OpenMP under MPI in MPI_FUNNELED mode, where you have multiple cores assigned to each MPI rank, each running multiple threads but less than 4 threads per core.

If you run plain OpenMP (just 1 rank), the KMP_HW_SUBSET facility is more convenient. It implies a balanced setting.

As Jim hinted, there are portable OpenMP environment variables which should be implemented in gnu OpenMP on linux (although not on all target OS). The GOMP_CPU_AFFINITY should be implemented on Intel OpenMP as well as Gomp. gnu and standard OpenMP settings will be over-ridden in Intel OpenMP by any conflicting Intel specific environment variable.

For KMP_AFFINITY=compact and scatter, the OpenMP standard settings OMP_PROC_BIND=close and spread are equivalent. Apparently, OMP_PROC_BIND=true defaults to spread https://software .intel.com/en-us/node/695719

OMP_PLACES=cores also will cover those cases where you set 1 thread per core.

Gnu OpenMP documentation doesn't tell which targets have affinity implementation (Windows, for example, does not). But I suppose you won't be trying gnu openmp for windows on MIC. Gnu OpenMP should warn at run time about any OpenMP standard environment settings which aren't implemented.

I got it now, Tim. Thanks for the detailed explanation.

Ha__Tran · ‎11-25-2017

I don't see performance degradation when I do similar analysis for scatter mode.

Eric__John · ‎10-06-2019

jimdempseyatthecove (Blackbelt) wrote:
John>>It is very easy to get a large performance degradation if you attempt to use all the logical processors in the system for a single parallel user job.
This depends on the application. For the vast majority of applications I would concur that less than 4 threads per core would be optimal. I suggest that you test with 4 threads per core as well as you may have one of those minority applications that benefit using all 4 threads per core. As it so happens I am working a simulation program that is a mix of OpenMP and MPI which does perform better using all 4 threads per core. This application manipulates many instances of smaller collections of small arrays (~4x4, 6x6, where on KNL of smabad night I convert 6x6 to 6x8, small arrays use in matrix multiply). This application also uses OpenMP Tasks as opposed to OpenMP DO (for) loops. So maybe OpenMP task-based programs with relatively short loops of half-way vectorizable data might benefit from all HT's.
A typical simulation run on 1 KNL (7210) takes 26 hours, 39 minutes) to run 8 years of simulation time. This application scaled well on up to 8 KNL nodes on the Colfax Cluster. The 16 and 32 nodes tests couldn't get availability of nodes on their cluster so I do not know how well it scales beyond 8 nodes. The initial production system may have 12 or 16 nodes of KNLs.
Jim Dempsey

It is very easy to get a large performance degradation if you attempt to use all the logical processors in the system for a single parallel user job.Thank you This will be much helping for me..!