some numbers about parallelization

LRaim · ‎08-02-2016

For information.

Today I have made a change to a core DLL of a dynamic simulation application (which at one point makes a call to PARDISO) and made some tests about vectorization and autoparallelization.
This core DLL computes physical properties (density, enthalpy, etc) of fluids and is probably the point where computer floating points resources are most used. I have changed the compiler options related to optimization and run some test on the same problem.

The results obtained are as follows.
Compiler XE 2015 update 6.
Vectorization option AVX always active.

1) parallelization none, minimize size (sic!) ; run time 167 s
2) parallelization none, optimization option O1 ; run time 161 s
3) parallelization none, optimization option O3; run time: 154 s
4) parallelization enabled, optimization option O3; run time: 206 s

The computer Intel Core I7 has 8 logical cpu. With no parallelization the average cpu usage is 60 %. With parallelization the cpu usage rumps up to 99%.
One should conclude that parallelization is wasting 40% of the cpu for more time.

jimdempseyatthecove · ‎08-02-2016

Are you using the Intel MKL PARDISO?

If so, then are you attempting/wanting to parallelize the code between the calls to the Intel MKL PARDISO?

And/or, are you attempting/wanting to parallelize the code that makes the calls to the Intel MKL PARDISO?

If yes to either of the last two questions, then consult the MKL forum. Without a more complete understanding of your application it is difficult to offer advise.

Jim Dempsey

LRaim · ‎08-02-2016

Jim,
the PARDISO subroutine called is Intel MKL and works fine.
I have only made some tests on the auto-parallelization of the code which forms a core dll and presented the results obtained (which are quite interesting in my opinion) and can be of interest to other people.
Optimization options of the Intel Fortran compiler is a subject of this Forum.
Finally I am not looking for advise about parallelization.

Best regards

Steven_L_Intel1 · ‎08-02-2016

My standard reply - have you run the program under VTune Amplifier XE and examined its parallelization analysis? It could be that your program spends too much time doing "setup" for parallel or is insufficiently parallelizeable. Not all applications are suitable for parallel - at least not without some restructuring.

jimdempseyatthecove · ‎08-02-2016

If you are using MKL, then for parallel programs, depending on the version you have will either

a) MKL will instantiate an (OpenMP) thread pool for each thread of a parallel application (older version). For a parallel application you would typically link in the serial version of MKL. This leads to oversubscription. And for serial applications you link in the parallel version of MKL (only one domain is parallelized except under explicit setup conditions)

b) MKL will (attempt to) share the application's OpenMP thread pool, typically with nested parallelism or task parallelism.

There are some links on the MKL forum as to how this is done.

Jim Dempsey

LRaim · ‎08-04-2016

Sorry of not being more clear in the explanation and conclusion.
At each integration time step the application calls once PARDISO (MKL) and many times the core DLL.
Having to rebuild the DLL I spent some time in changing the optimization options and running the same test case.
What timing results suggest is that the Intel Fortran Compiler has done a very bad job when asked to auto-parallelize with 100% probability threshold.
The result says that the compiler has FORCED auto-parallelization sucking up all the cpu (99%) but increasing the overall elapsed time.
When auto-parallelization is not activated the maximum cpu usage (probably set by PARDISO) is only 60% and the test case completes in less seconds.

Regards

jimdempseyatthecove · ‎08-04-2016

Luigi,

The problem is likely not that of the Fortran compiler, rather it is a lack of understanding of how auto-parallelization works, MKL parallelization works, .AND. the controls necessary to try to get them to cooperate together.

From the description of your application, my guess at what is going on is

Your application, via auto-parallelization, has its own thread pool.
MKL (parallel version), will create and maintain its own thread pool.

Each thread pool, presumably of size of number of logical processors, will be competing for processor resources. (oversubscription)

The default behavior of each thread pool is upon exit of a parallel region, that each (non-master) thread enters a compute intensive spin-wait for up to a default number of milliseconds (300) in anticipation of entering a next (or due to loop same) parallel region prior to the expiration of the spin-wait time. This reduces the latencies of suspending, then resuming the non-master threads (of each thread pool). The effect of this is (as default configured) is to have 2x the number of logical processors in spinwait, or 1x additional number of logical processors in spin-wait while the other 1x number of logical processors are performing useful work.

To control this behavior, under this circumstance (two thread pools), consider setting the environment variable KMP_BLOCKTIME=0. This sets the spin-wait time to 0 (no spin-wait). This will cause the non-master threads exiting the parallel region to immediately suspend, thus making the logical processors available for the other thread pool to use (unencumbered).

Note, when running with one thread pool, you would not want to use KMP_BLOCKTIME=0 as this would introduce the additional overhead of thread suspension/resumption.

Jim Dempsey

TimP · ‎08-04-2016

Further to Jim's suggestions, you would want to assure that MKL is permitted to spread threads out one per physical core as it does when MKL parallel is called from a single thread. Setting -Qparallel doesn't ask the compiler to guess whether your application can benefit from 2 threads per core.

Running threads on all hyperthreads may give you some satisfaction in terms of pinning your performance meter, but it's well known (read the MKL docs) that MKL can run faster with one thread per core.