Solved: Compiler 12.1 OpenMP performance at multicore systems - libiomp5md.dll - Page 2

Miket · ‎09-20-2011

I have a problem with the new version of Intel C++ Compiler 12.1.0.43 (Parallel Studio 2011 SP1). My x86 application has degradation of performance about 30-40% at multicore systems (two x5690 processors, Windows 7 x64) comparing to the same code compiled with earlier version of C++ Compiler.

Withsmaller number of cores (single i7-M640) performancesare almost the same.

After some experiments I discovered thatsimple replacement of OpenMP DLL libiomp5md.dll with the earlier version resores previous performance. In particular I replaced libiomp5md.dll version 5.0.2011.606 with libiomp5md.dll version 5.0.2011.325.

Therefore the question is: what changed in libiomp5md.dll that could be a reason of such degradation? How can I restore previous performance?

As a note: I compared performances at relatively simple problems when small number of cores were sufficient, and growing number of cores most likely caused degradation. But I want to minimize this negative effect!

Regards,
Michael

jimdempseyatthecove · ‎09-29-2011

Andrey's guess may or may not be correct. Your application, with presumably a single OpenMP context, should not overscubscribe threads. You may need to reduce the OpenMP thread pool sizeby some number of your non-OpenMP threads. KMP_BLOCKTIME controls inter-process interactons (due to both processes being fully subscribed) but should not affect intra-process (due to single thread pool notexceeding total processing resources).

Shorter KMP_BLOCKTIME values is a convienence to release time to other processes (either single or multi-threaded). If reducedKMP_BLOCKTIME is increases the perfromance within a single process then this is indicative that this process is oversubscribed.

To confirm or reject multiple pools see if you can get the .DLL name(s) of the duplicated entries in VTune. This information might (should) be visible in one of the views.

The total number of threads, via report by thread will also be indicative of multiple thread pools.

If you observe multiple thread pools (DLL's library), then use one of the tuning options to generate a call tree analysis. The report will point to which thread root is calling which library. Also note, it might be possible to combine a different versioned static library with DLL library (although I have not tested this possibility).

Jim Dempsey

View solution in original post

Andrey_C_Intel1 · ‎09-29-2011

Michael,

The blocktime still can be the reason of slow execution. Becauseit is not a static parameter, it may be changed by the library dynamically, e.g. when the number of threads exceeds the number of available compute units. So please try KMP_BLOCKTIME setting suggested by Vladimir, this setting will prevent the OpenMP library from changing behavior dynamically.

The library is not perfect, so when it sees that the number of threadsbecomes big it decides that this may cause oversubscription, and drops the blocktime to 0. Old library didn't do this and caused the problems in real oversubscription cases. In your case there is no real oversubscription, so it is better to keep blocktime with some reasonable value, not 0. It is not simple for the libraryto distinguish "real" oversubscription from "virtual" when most threads are sleeping, so the change in the library hurt your particular case, I think.

Regards,
Andrey

jimdempseyatthecove · ‎09-29-2011

Michael,

Do not use omp_get_num_threads() inside a parallel region as this will return the number of threads available to the current OpenMP parallel region. What I am interested in is total number of threads and in particular total number of OpenMP threads in potentially(multiple) concurrent parallel regions. Postulation a) in prior message assumes multiple OpenMP thread pools (one per version of DLL componentor caller to DLL component(assuming multiple versions was reason for duplicate function names in VTune report)).

Sorry about (())'s

Jim

Andrey_C_Intel1 · ‎09-29-2011

Jim,

It is unlikely that multiple OpenMP runtimes are executed in the application. By default the application should abort in this case. User should explicitelyset KMP_DUPLICATE_LIB_OK environment in order to be able to work with multiple OpenMP runtimes concurrently.

Duplicated names in Amplifier's report may be a problem of the Amplifier.

- Andrey

Miket · ‎09-29-2011

Andrey,

Super! It was exactly what I was asking for. This change affescted my applcation.I ended with

kmp_set_defaults( "KMP_BLOCKTIME=200" );

inDLL_PROCESS_ATTACH case of DllMain, since kmp_set_blocktime affects only calling thread settings.

Performance is restored back, thank you.

Regards,
Michael

Miket · ‎09-29-2011

Jim,

No problems with ((()))'s, I also often overuse them.

It looks like Andrey's guess was correct, please, see my answer.

Thank you for the efforts,
Michael

jimdempseyatthecove · ‎09-29-2011

Andrey's guess may or may not be correct. Your application, with presumably a single OpenMP context, should not overscubscribe threads. You may need to reduce the OpenMP thread pool sizeby some number of your non-OpenMP threads. KMP_BLOCKTIME controls inter-process interactons (due to both processes being fully subscribed) but should not affect intra-process (due to single thread pool notexceeding total processing resources).

Shorter KMP_BLOCKTIME values is a convienence to release time to other processes (either single or multi-threaded). If reducedKMP_BLOCKTIME is increases the perfromance within a single process then this is indicative that this process is oversubscribed.

To confirm or reject multiple pools see if you can get the .DLL name(s) of the duplicated entries in VTune. This information might (should) be visible in one of the views.

The total number of threads, via report by thread will also be indicative of multiple thread pools.

If you observe multiple thread pools (DLL's library), then use one of the tuning options to generate a call tree analysis. The report will point to which thread root is calling which library. Also note, it might be possible to combine a different versioned static library with DLL library (although I have not tested this possibility).

Jim Dempsey

Miket · ‎09-30-2011

Jim,

Thanks a lot for detailed explanations. I checked the situation again and can confirm that my application really creates an additional thread pool at some point. Thus total number of OMP Worker threads is growing from 24 to 48.

The typical situation when it happens looks as follows. One non-OpenMP thread calling DLL function with OpenMP parallel construct finishes and immediately another non-OpenMP thread starts. This thread also calls DLL function with OpenMP parallel construct. It is very important that time difference should be very small, this is why I started to notice this only at 3.5GHz X5690 system.

If I insert Sleep(200) before second call to DLL, an additional thread pool is not created, the application is running with 24 OMP Worker threads.

So I can conclude at this point that current implementation of OpenMP doesn't like calls from different non-OpenMP threads, even if these calls are properly serialized.

I guess that major revision of the threading model at GUI part of my application may be necessary. For example, only one computational non-OpenMP thread obtaining computational tasks of different kinds and managing their proper execution. With this approach all DLL functions having OpenMP parallelization will be always calledin the context of the same external thread.

Regards,
Michael

jimdempseyatthecove · ‎09-30-2011

Michael,

This may be a work around hack.

With a single OpenMP thread pool you would normally want a KMP_BLOCKTIME of some reasonable amount (~200ms). With your app creating multiple pools setting KMP_BLOCKTIME to 0 would "mitigate" the situation somewhat (as you observed) but is not really what you want. If you do not mind experimenting try using KMP_BLOCKTIME = 200 then

int saveBlockTime = kmp_get_blocktime(); // Intel extension
kmp_set_blocktime(0);
#pragma omp parallel
{
if(kmp_get_blocktime() = 9999) printf("Not going to happen\m");
}
kmp_set_blocktime(saveBlockTime);
callYourDllHere(args);

See if this gives you the performance back (when multiple threads can call OpenMP).

Jim Dempsey

jimdempseyatthecove · ‎09-30-2011

I forgot to mention (you probably figgured it out anyway)
The intention is to have a reasonable block time for parallel regions in your application
... except for the last region before your call into the DLL.
... this parallel region has 0 block time.

*** Note ***

This assumes you mutex-ized the calls to the DLL
*** and only call from the "main" of your app and any non-openMP threads launched ***
should you call from within a parallel region then all bets are off

Jim Dempsey

SergeyKostrov · ‎09-30-2011

Hi Michael,

I've been using a very simply solution to track down an "unexplained"problem or performance degradation with aLogging API, instead of Code Profilers or Performance Analyzers.

Usually,it is a verytime consuming process, but you need toUNDERSTAND the problem, right? It means that after sometime you won't have a choice.

So, this is what I recommend andthis iswhat I've done many times in the past:

1.Create a txt-log filewith as simple as possible Logging API( you don't need anew complicated software subsystem which could bring another problems );

2.Integrate Logging calls in a software subsystem of your application which experiences performance degradation;

3. Start with just two Logging API calls, that is, when processing "Starts" and when processing "Ends";

3.Test as better as possible with a "Right-DLL";

4.Replace the "Right-DLL" with a "Wrong-DLL";

5.Test as better as possible withthe "Wrong-DLL";

6.Compare execution times ( what I understand you have already some statistics, but it looks like it didn't help);

7.Narrow down your search, that is, adda couple ofmore Logging API calls;

8. Repeat steps 3, 4, 5and 6;

9.Compare results and try to identifyallparts with differentexecution times;

10. Repeat steps3, 4, 5, 6and 7, and so on...

In overall, it could take many-many hours, or even days and weeks,of carefull testing and analyzing. But, I trully believe that you'll finally find a couple of code lines "responsible" for performance degradation.

Remember, that Internet activity, gaming, online chating, paging to a virtual file, etc, could affect execution times! Your tests must be done in comparable environments.

Here is anexample:

...

uiTicksStart = SysGetTickCount();
...
// Somepart of codesto be tested
...
uiTicksEnd = SysGetTickCount();

LogToFile( RTU("Completedin: %ld ticks\n"), ( RTint )( uiTicksEnd - uiTicksStart ) );
...

Also, I think it is a real problem for yourprojectthat you don't haveisolated test-cases.

Best regards,
Sergey

Senior C++ Software Developer

PS: Sometimes evenasimple output to a console window could help to identify the problem.

jimdempseyatthecove · ‎10-01-2011

Michael,

The following conveyance of my thoughts are with respect of getting you running with your current code base as opposed to waiting for ompilib5 fix.

After thinking about your problem and symptoms I think I may now offer better advice. (difficult since I am doing this by proxy)

The release of kmp_blocktime should come _after_ the DLL call
However..... consider the situation:

(arbitrary thread outside parallel region)
for(...
{
dll_fn1();
dll_fn2();
...
dll_fnn();
}

Where each of the above functions use parallel regions.

When these functions have short lived parallel regions you would likely not want short block time, therefore it may be advantageous to place the release (kmp_blocktime(0); followed by all thread dummy parallel region) after the for loop.

When these functions are long lived you may or may not want long block times (experimentation warranted).

Added to this foray you apparently have other threads doing the same thing, with your current "fix" being a mutex before each call to DLL.

In the case above where you have the series of short lived functions, the mutex and per call release of thread pool will adversely affect your performance.

Considering the above train of thought possibly something like this may be worth investigating:

1) Remove the mutex.
2) Code all app created threads as if they were cooperative separate OpenMP processes (actually there is no code change but an awareness to your situation).
3) Add global variable:
volatile longcountOfActiveSessions = 0;
4) Add shared function
void EnteringSession()
{
_InterlockedIncrement(&countOfActiveSessions);
int nThreads = (omp_get_num_procs() // .or. omp_get_num_threads()
+ (omp_get_num_procs()/ 2))
/ countOfActiveSessions;
if(nThreads == 0) ++nThreads)
omp_set_num_threads(nThreads);
}
5) Add shared function
void ExitingSession()
{
_InterlockedDecrement(&countOfActiveSessions);
int old_blocktime = kmp_get_blocktime();
kmp_set_blocktime(0);
#pragma omp parallel
{
if(countOfActiveSessions .lt. 0)
printf("Not going to happen\n");
}
kmp_set_blocktime(old_blocktime);
}

5) Then prior to for(... loop containing series of DLL calls insert function call to EnteringSession, and following this for(... insert call to ExitingSession().
.OR.
prior to single call (or short run of calls)to DLL insert function call to EnteringSession and following call (or run of calls) insert Exiting session.

Not seeing your application it is hard to ascertain if the above is the best technique to apply, but this may be a good starting point for experimentation.

Remaining unknowns:

Does (would) each "Session" use all the threads?
e.g. parallel sections with small number of sections.

Are any of the calls nested in DLLor any of the calls to the DLLfrom nested region?
Note creating sessions for each nest level may be appropriate or may not. Some examination and experimentation may be warranted.

You may need to expand on EnteringSession/ExitingSession to take an argument containing a load or weight value for the session. The calling weight is added to the number of sessions and the portion of the weight to the new session count is used to prorate the reapportionment of the number of threads to use.

This should get you on track to optimizing your application.

Jim Dempsey

Miket · ‎10-01-2011

Jim,

Thank you for the proposal, it may be quite useful!

At the moment adding

kmp_set_defaults( "KMP_BLOCKTIME=200" );

In the DllMain restored previous performance...

...But also restored stability issue that I mentioned in my first posts. I believe it is connected with additional thread pool created by OpenMP when the next parallel construct is called too quickly from a different thread.

I will try proposed approach and also will try to redesign GUI part in order to have just one "Computing" thread receiving different work Tasks. It will require some time to implement, of course, since the project is quite compicated.

Thank you again,
Michael

Miket · ‎10-01-2011

Hi Sergey,

Thank you for the proposal. In fact any logging of this kind is critical for my application since it affects timings a bit. The problem often disappears when I introduce even a tiny delay before starting another non-OpenMP parent thread.

I performed similar tests using not log files, but kind of messaging system based on shared memory. These tests + Amplifier XP helped to reveal situatins when an additional (not necessary) thread pool is created by OpenMP system, now I am trying to find a way to avoid this.

And of course I have a set of test cases, but they are also binded to the main GUI application, since any test requires rather complicated preconfigured data structures that should be loaded from custom databases. There is an obvious disadvantage, I cannot submit such test cases for independent analysis, I can use them only internally.

Best regards,
Michael

Andrey_C_Intel1 · ‎10-02-2011

Hi Michael,

I just want to clarify a little bit the way OpenMP runtime works with threads. When one master thread creates worker threads for the parallel region, these threads cannot be re-used in another master thread while the first master thread is alive. This happens because the OpenMP runtime keeps these threads for re-use by the first master thread as the OpenMP specification requires (indirectly), the runtime cannot know that you are not going to launch next parallel region from the same master thread. When the first master thread dies, its worker threadsmay be re-used in parallel region of another master thread, and youpossibly observed this situation in case Sleep(200) usage.

So in order to re-use the same thread pool for many paralell regions you should either launch all regions from the same master thread, or ensure that previous master is dead before starting parallel region in another master thread.

Regards,
Andrey

jimdempseyatthecove · ‎10-02-2011

Michael,

I think I have some additional improvements to suggest.

Use the session techniquewith or without weights as outlined before andwith the following changes.

volatilelong nSessions = 0;
long nProcs = 0; // initialized at start of program
__declspec (thread) long priorThreadCount = 0;

void releaseThreadPool()
{
int oldBlockTime = kmp_get_blocktime();
kmp_set_blocktime(0);
#pragma omp parallel
{
if(nProcs .lt. 0)
printf("not going to happen");
}
kmp_set_blocktime(oldBlockTime);
}

void setThreadCount()
{
long currentThreadCount =
nProcs / (nSessions + (nSessions / 2)); // or your weight function
if(currentThreadCount .lt. priorThreadCount)
releaseThreadPool();
priorThreadCount = currentThreadCount;
omp_set_num_threads(currentThreadCount);
} // void setThreadCount()

void EnterSession()
{
_InterlockedIncrement(&nSessions);
setThreadCount();
}

void ExitSession()
{
_InterlockedDecrement(&nSessions);
releaseThreadPool();
}

...
(some thread outside parallel region)
EnterSession();
for(...)
{
setThreadCount();
DLLfunc1();
setThreadCount();
DLLfunc2();
...
setThreadCount();
DLL_lastFuncInLoop();
}
ExitSession();
----------- .AND./.OR. -------------
(some thread outside parallel region)
for(...)
{
EnterSession();
setThreadCount();
DLLfunc1();
setThreadCount();
DLLfunc2();
...
setThreadCount();
DLL_funcn();
ExitSession();
otherLongNonOpenMPfunction();
}

Jim Dempsey

Miket · ‎10-14-2011

Jim, Andrey,

I am very grateful for this fruitful discussion and for all advises. Finally I came up with the necessity to redesign threading approach at my GUI part (non-OpenMP). I will use thread pools/task paradigm, all DLL calls to functions with OpenMP parallel constructs will be performed from the same worker thread. I believe it will make current OpenMP implementation happier.

Indeed, I can see creation of another 24 working threads if OpenMP DLL function is called too quickly in a context of a different external thread. When all calls are serialized and bound to a single external thread, everything works just fine.

Regards,
Michael

jimdempseyatthecove · ‎10-14-2011

Michael, and others reading these forum messages....

In the last post of mine outlaying a programming strategy for use in a single parallel application (process) calling a DLL (independently parallelized per calling thread), it should be intuitively obvious that with a little more work the "sessions" technique can be extended to multiple independent parallel processes. IOW the session count is stored/maintained in a system wide accessible object (e.g. registry, memory mapped file, etc...). The extension of this technique would yield a cooperative multi-threading amongst participating multi-threading processes.

ALthough this does not address Michael's current situation, it quite easily will address future situations. An example of this is Michael might at some time run two copies of his current application at the same time (on different data sets).

Jim Dempsey