Intel® C++ Compiler
Community support and assistance for creating C++ code that runs on platforms based on Intel® processors.
7943 Discussions

Compiler 12.1 OpenMP performance at multicore systems - libiomp5md.dll

Miket
Beginner
1,861 Views
I have a problem with the new version of Intel C++ Compiler 12.1.0.43 (Parallel Studio 2011 SP1). My x86 application has degradation of performance about 30-40% at multicore systems (two x5690 processors, Windows 7 x64) comparing to the same code compiled with earlier version of C++ Compiler.

Withsmaller number of cores (single i7-M640) performancesare almost the same.

After some experiments I discovered thatsimple replacement of OpenMP DLL libiomp5md.dll with the earlier version resores previous performance. In particular I replaced libiomp5md.dll version 5.0.2011.606 with libiomp5md.dll version 5.0.2011.325.

Therefore the question is: what changed in libiomp5md.dll that could be a reason of such degradation? How can I restore previous performance?

As a note: I compared performances at relatively simple problems when small number of cores were sufficient, and growing number of cores most likely caused degradation. But I want to minimize this negative effect!

Regards,
Michael
0 Kudos
1 Solution
jimdempseyatthecove
Honored Contributor III
1,861 Views
Andrey's guess may or may not be correct. Your application, with presumably a single OpenMP context, should not overscubscribe threads. You may need to reduce the OpenMP thread pool sizeby some number of your non-OpenMP threads. KMP_BLOCKTIME controls inter-process interactons (due to both processes being fully subscribed) but should not affect intra-process (due to single thread pool notexceeding total processing resources).

Shorter KMP_BLOCKTIME values is a convienence to release time to other processes (either single or multi-threaded). If reducedKMP_BLOCKTIME is increases the perfromance within a single process then this is indicative that this process is oversubscribed.

To confirm or reject multiple pools see if you can get the .DLL name(s) of the duplicated entries in VTune. This information might (should) be visible in one of the views.

The total number of threads, via report by thread will also be indicative of multiple thread pools.

If you observe multiple thread pools (DLL's library), then use one of the tuning options to generate a call tree analysis. The report will point to which thread root is calling which library. Also note, it might be possible to combine a different versioned static library with DLL library (although I have not tested this possibility).



Jim Dempsey

View solution in original post

0 Kudos
37 Replies
jimdempseyatthecove
Honored Contributor III
1,386 Views
Michael,

Can you profile the app using eachlibrary on the 2P system?
You should be able to identify the routine in libiomp5md.dll that is introducing the delay.
A 2P system may have to reach out through RAM for some synchronization whereas a 1P may be able to synchronize within the last level cache. The difference you see between library versions may be due to thread juxtopositions in your application (causing competing threads to reside on each processor as opposed to one processor) .OR. due to bug fix in library .OR. inefficiency introduced into newer library. Posting your identification of the routine in the libiomp5md.dll may lead someone at Intel (reading this post) to explain/fix the problem.

Jim Dempsey
0 Kudos
Miket
Beginner
1,386 Views
Jim,

Unfortunately the result is obtained at a very large and complicated project. It is difficult to extract a test case basing on this project.

Neverhteless I performed one more test at another dual processor computer: 2x Intel 5160 (4 cores in total). In this case I also see significant difference when mentioned in my first post versions of libiomp5md.dll are replaced. With the latest version of libiomp5md.dllI can clearly see a degradation on the level of 10-15% in this case. It is not so noticeable as in the case of 2x X5690 system, but also quite significant.

Should I create and submit a Report basingon these observations?

Michael
0 Kudos
jimdempseyatthecove
Honored Contributor III
1,386 Views
Michael,

I think without a simple reproducer submitting to Premier Support would be futile. They almost always require a reproducer. Your better route is to make notice of this issue here on this forum (as you have) with purpose of canvassing other users with regard to their expirences. I seem to recall similar issues with libiomp5md(mt).dll verses 1P and nP systems but I cannot recall which version reported this issue. Perhaps Steve L. or someone else can add to this observation. Running VTune or other profiler on the two systems/libiomp5md.dll combinations may help to identify the root cause: critical section, event, scheduler, adverse cache interaction, memory alignment, ...

There was also reported an issue where aligned malloc did not honor the alignment request. Although this is a C runtime library and/or O/S issue, the different libraries may have accidentally caused one to take a heavier hit in performance. The profiler may yield some insight as to what is happening. I know that this is not your job... your job is to worry about potential problems of retrograding the .dll version.

Jim Dempsey
0 Kudos
Vladimir_P_1234567890
1,386 Views
Hi Michael,
As Jim mentoned without reproducer it would be hard to understand where the problem is.
Could you list OpenMP constructions are mostlyused in the program (function names, etc)?
thanks,
--Vladimir
0 Kudos
jimdempseyatthecove
Honored Contributor III
1,386 Views
Michael,

Perhaps you can do the following to shed some light on the problem _without_ sending in code.

Using a profiler make test runs using each library, have test run sufficiently long to produce reasonably accurate data (say 20 second run). Usualy the first report screen of a profiler will be routine names and percent of run time sorted High to Low by run time. Capture this report, preferrably as text as opposed to screenshot (easier to read text). If you can caputre the entire list it would be preferable (sum of lesser routines may eat up the 10%-15% difference).
What we would be looking for is one of the libiomp5md routines incurring additional overhead.

Jim Dempsey
0 Kudos
Miket
Beginner
1,386 Views

Hi Jim, Vladimir,

Sorry for some delay, it looks like I was probably affected by

DPD200134977 C++, Fortran issue with libiomp5md.lib, reduction -:

In rare cases my application had (rare!) random hangs inside one of OpenMP constructs. It occurred if libiomp5md.dll 5.0.2011.325 was loaded, while exactly the same application worked just fine with libiomp5md.dll 5.0.2011.606. In both cases the rest of application remained untouched.

I performed the required test with Intel VTune Amplifier XE 2011. I believe the results show some basic difference between these two versions of libiomp5md.dll, in particular, how aggressively OpenMP threads are utilizing available CPU cores.

For the reference: tests were performed at the computer with 2 X5690 CPU, HT = On, SpeedStep and TurboBoost = OFF. Therefore 24 logical cores are available in this system. In both cases only libiomp5md.dll was replaced. OpenMP is used in a DLL, this DLL is loaded by GUI application written in Delphi. GUI part is also multithreaded, typically DLL functions are called from different secondary threads of GUI in order not to block UI during computations. I tested rather complicated algorithm requiring to call many different DLL functions (with OpenMP parallel constructs) in some externalloop. The test run takes about 20-30 sec.

A) libiomp5md.dll 5.0.2011.325 Summary of Lightweight Hotspots analysis


Elapsed Time: 22.565s

CPU Time: 336.113s

Instructions Retired: 599,580,000,000

CPI Rate: 1.938

Paused Time: 0s

B) libiomp5md.dll 5.0.2011.606 Summary of Lightweight Hotspots analysis

Elapsed Time: 29.971s

CPU Time: 88.789s

Instructions Retired: 143,308,000,000

CPI Rate: 1.874

Paused Time: 0s

Immediately we can see significantly less CPU Time parameter in the 2nd case!

Do you still need more detailed information on particular functions inside libiomp5md.dll?

At the moment I have a feeling that version 606 doesnt like when OpenMP parallel region is called by different threads of the application, like happens in my case.


Regards,
Michael

0 Kudos
TimP
Honored Contributor III
1,386 Views
Are you trying to have multiple independent instances of OpenMP running under various parent threads? OpenMP isn't necessarily well adapted to such a situation, but the newer library may be noticing what is happening and limiting the number of threads.
Did you try using affinity options (e.g. setting non-overlapping affinity masks via KMP_AFFINITY for the various OpenMP instances)?
0 Kudos
Miket
Beginner
1,386 Views

In my case instances of OpenMP are never running concurrently. They are launched sequentially one by one. Parent threads of GUI applications are often different, but they are synchronized in sequential order. Thus I do not see any need to use KMP_AFFINITY.

I checked values ofomp_get_num_threads,omp_get_max_threads, omp_get_thread_limit,omp_get_dynamic, kmp_get_blocktime,omp_get_schedule

All these functions return the same values for both versions of libiomp5md.dll.

-Michael

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,386 Views
Michael,

You provided a total application summary report.
With VTune you can also get a function by function summary report.
Example screen shot below.
In your case you will want to find the function names in each report with high relative instructions retired differences.

Jim




0 Kudos
jimdempseyatthecove
Honored Contributor III
1,386 Views
>>GUI part is also multithreaded, typically DLL functions are called from different secondary threads of GUI in order not to block UI during computations.

Are you saying the GUI part is coded using say pthreads, or beginthread, or ... (non-OpenMP thread),
and which each of these threads may concurrently call the DLL,
and where each DLL called function will use OpenMP under the assumption that the OpenMP thread pool is entirely owned (by its call context)?

If so, you should understand that OpenMP is not designed to operate this way.

What you have here is a situation where multiple app GUI and non-GUIthreads have a "main" context (i.e. running outside parallel region). Then any number of these threads may concurrently enter its first parallel region under the assumption (or effect) that it is the first parallel region of the applicaiton. Internally this may cause adverse effects for OpenMP. Externally, observed by your application, potentially you have one thread in each of these concurrent DLL calls at it's independent main level with omp_get_thread_num() == 0. Would this cause any programming errors?

Jim Dempsey
0 Kudos
Miket
Beginner
1,386 Views
Jim,

I created 2 reports. Unfortunately it is not easy to copy it as text in a readable form (why Amplifier has no export of the current report to HTML, for example?). Hope you will find required information (I am new to Amplifier XE, used only bundled Composer version before). So the screenshots ae below. I am afraid they are not very helpful, since the problem may be connected with the basic design of my software.

Regards,
Michael
==================

A) "Old" libiomp5md.dll 5.0.2011.325


Jim,


B) "New" libiomp5md.dll 5.0.2011.606


0 Kudos
Miket
Beginner
1,386 Views

>> Are you saying the GUI part is coded using say pthreads, or beginthread, or ... (non-OpenMP thread),

Yes, some wrapper around beginthread/endthread (non-OpenMP).

>> and which each of these threads may concurrently call the DLL,

Not at all! By design I am avoiding concurrent calls of DLL functions having OpenMP parallel constructs. These calls are serialized. Other functions not having OpenMP inside are called, of course.

>> and where each DLL called function will use OpenMP under the assumption that the OpenMP thread pool is entirely owned (by its call context)?

Yes, it is designed in this way. I was assuming that serialization (see answer #2) makes this possible.

>> If so, you should understand that OpenMP is not designed to operate this way.

Some time ago I really had a problem when by accident (programming error) I had a concurrent call to two DLL functions with OpenMP. It caused stability problems, after I serialized such calls the problem gone and I have no problems with correct results, stability, etc.

Are there any recommendations how to use OpenMP in DLL for my scenario? If yes, where can I find a description?

>> Externally, observed by your application, potentially you have one thread in each of these concurrent DLL calls at it's independent main level with omp_get_thread_num() == 0. Would this cause any programming errors?

Even with a single thread for every parallel region the program should work correctly with an obvious impact on performance.

Regards,
Michael

0 Kudos
TimP
Honored Contributor III
1,386 Views
The guide.gvs from OpenMP profile is already a text file, a more concise summary by parallel region of the time spent by each thread in major OpenMP functions.
.csv export (text file readable as spreadsheet) is a design feature of Amplifier; I don't know why restoring it is such a low priority.
0 Kudos
Vladimir_P_1234567890
1,386 Views
Great, thanks! I hope it will help.
--Vladimir
0 Kudos
jimdempseyatthecove
Honored Contributor III
1,386 Views
Michael,

The total run times for the two programs differ ~ 10:1 old:new
__kmp_fork_barrier, __kmp_x86_pause, __kmp_yield on old are ~100x onnew.

In the "old" run I notice many of the functions have duplicate names!!!!
Looks like you have two "omp" libraries loaded.
e.g. two different versions of libiomp5md.dll or a combination of libiomp5md.dll and some other lib(i)omp(5)(md).dll
letters in () are subject to change or elimination.

Jim Dempsey
0 Kudos
Vladimir_P_1234567890
1,386 Views
Could you define environment variableKMP_BLOCKTIME=200 and try again?
Or you can play with values e.g. from range 100-500.
--Vladimir
0 Kudos
jimdempseyatthecove
Honored Contributor III
1,386 Views
Vladimir,

KMP_BLOCKTIMEis intended to reduce adverse interaction between applications as opposed to within a singleapplicaiton. While reducing KMP_BLOCKTIME may improve this particular application, assuming this is a symptom of oversubscription, it will not correct the underlaying problem. The problem may be one of thesetwo hypothesis:

a) one or more components of this application were built with different versions of the OpenMP library and Windows Side-by-Side is dutifully loading the components to use their respective libraries. The end result potentially being thread unavailability to either or each library in a similar manner to interaction with different application. In this case, reducing KMP_BLOCKTIME would/may improve perfromance, but it will not fix the underlaying problem.

b) The two libraries are partially co-mingled between themselves (as opposed to running independently). The result of this is a near-deadlock as observed by the excessive times indicated in earlier post.

A potential way to determine a) or b) would be to count the number of threads used by the application.

a) ~= number of user application created threads + 2x number of threads expected for OpenMP thread pool
b) ~= number of user application created threads + 1x number of threads expected for OpenMP thread pool

Note, ~= due to OpenMP allocating the number of threads (specified/default) for its thread pool plus 0 or more helper threads. In either case Michael can compare total number of threads each version uses.

Jim Dempsey
0 Kudos
Miket
Beginner
1,386 Views
Jim,

>> Looks like you have two "omp" libraries loaded

Is it possible?? In the column "Module path" I see the same path/file name for all entries including duplicates. Also I checked again the list of loaded modules, the loaded file is correct.

It seems to me that duplicate entries appeared due to a bug in Amplifier. Before preparing the screenshot I played with different presentation of the result and tried to perform Compare operation.

Below is another run with the "old" version of libiomp5md.dll - fresh report.

Any recommendations to my scenario of use OpenMP DLL from non-OpenMP multithreaded GUI? Calls to OpenMP DLL functions are serialized, but typically performed from different secondary threads of GUI application.

Regards,
Michael





0 Kudos
Miket
Beginner
1,386 Views
Vladimir,

I already checked that in both versions ("old" and "new"libiomp5md.dll) the function

kmp_get_blocktime() returns 200).


Thus there is no difference in this parameter, it cannot be a reason for slower execution.

Regards,
Michael

0 Kudos
Miket
Beginner
1,237 Views
Jim,

I checked that both versions return the same values for functions:

omp_get_num_threads() -> 1 outside a parallel region and
omp_get_num_threads() -> 24 inside a parallel region

Other functions return the same result inall cases (parallel and plain regions, "old" and "new" versions):

omp_get_max_threads() -> 24
omp_get_thread_limit() -> 32768
omp_get_dynamic() -> 0
kmp_get_blocktime() -> 200

omp_get_schedule( kind, modifier ) -> kind = 1, modifier = 0

Anything else we could check?

What about submitting my application in binary form for testing libiomp5md.dll ? I have a DEMO version that is able to run this test case without limitations. If you are interested, please, send me instructions how to do this privately.

Regards,
Michael
0 Kudos
Reply