- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
From time to time we get
OMP: Warning #215: Cannot determine machine load balance - Using KMP_DYNAMIC_MODE=thread limit
What is this message and is it harmful?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
KMP_DYNAMIC_MODE selects the method used to determine the number of threads to use for a parallel region when OMP_DYNAMIC=true. So this is potential performance warning.
But i'm not sure how this OpenMP question relates to TBB forum.
--Vladimir
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
KMP_DYNAMIC_MODE selects the method used to determine the number of threads to use for a parallel region when OMP_DYNAMIC=true. So this is potential performance warning.
But i'm not sure how this OpenMP question relates to TBB forum.
--Vladimir
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks Vladimir!
Can you move the thread to the OpenMP Forum. I must have been blind while browsing through the Forum list.
So OpenMP cannot set the maximum number of threads because it has lost count how many cores the machine has?
What will be used then?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Runtime knows what a machine HW concurrency is but it is looking for system-side active threads on the machine for load balancing of this particular application with "OMP_DYNAMIC=true" set. In your case you might get an oversubcsription in case for example you run 2 similar openmp application on the machine that are expected to use full HW concurrency.
Re: OpenMP forum - Actually there are several forums where you can ask openmp questions and none of them does have OpenMP in the name:). You need to select appropriate C++ or Fortran compiler forums for OpenMP questions.
--Vladimir
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
CPU oversubscription would be bad since we are latency sensitive. How can we fix/debug this issue?
Alois
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Alois K. wrote:
How can we fix
/debugthis issue?Alois
Just run one openmp process at once. :)
--Vladimir
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
That would be nice but unfortunately we do have several processes doing OpenMP work from time to time. If the latency suffers due to oversubscription we have to fix it. What should we do?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Actually the CPU load is checking in the begin of every parallel region. So I think that getting such warning from time to time is not a problem in case you do not use _one big_ parallel region per program.
--Vladimir
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Alois,
The message you've got means that the OpenMP runtime library failed to determine the machine load. This might happen because of various issues depending on the OS you are using (e.g. something is wrong with getloadavg on OS X, or with NtQuerySystemInformation on Windows, or with reading /proc system on Linux, etc.). After that the library switches to "thread limit" method that tries to use all resources of the machine, it will not attempt to determine the machine load any more.
To avoid oversubscription in this case the only solution I can think of is to manually limit number of threads for each process so that it only uses part of the machine. E.g. if you have only two OpenMP processes run simultaneously on the machine, then you can give half of resources to each. For three simultaneous processes - one third of resources to each, etc.
Thus you can get undersubscription sometimes when only single process runs actively, but not have oversubscription when more processes are active, Cannot come up with a better solution for now.
Regards,
Andrey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks Andrey/Vladimir for the detailed response. I was thinking on the same lines in that I don't see any other solution other than to limit the threads per process distributing the load accordingly.
_Kittur
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If your latency sensitive, it is likely that only one, or possibly a few threads of the application require low latency. Here is a sketch of what you can do. Assume for the sake of this argument you only require the main thread and at most one helper thread of an application to have low latency. Assume you wish to run multiple such processes and are willing to accept the responsibility to not run more than 1/2 the number of logical processors number of such processes. Assume the (each) process is mostly compute bound in those first two threads.
Application startup has a warm-up period where the threads are NOT affinity pinned. Raise the priority of just the first two threads of the process and then run sufficiently long enough to get relocated to a (two) logical processors that are not doing the same thing. Once you are satisfied your threads have been relocated, pin just those two threads to the logical processor each is running on. Then exit the warm up code.
During normal processing use dynamic scheduling with a chunk size specified (which you will have to determine), together with nowait. Although all threads can be scheduled, the loop can complete using the two higher priority threads.
To reduce the number of unnecessary threads specified for the next parallel region your threads of the prior region can determine if (and how much) work has been performed. Then adjust a value to be used on the next parallel region.
This is just a starting point. You may have requirements not clearly stated.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Jim, that's a good starting point and well put, thx.
_Kittur
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Ok first some background. I have Windows 8.1 Pro machines with 8 Cores where on 4 cores a data acquisition system is running which must process the data as fast as possible. On the other 4 cores the UI and other post processing stuff is running. The complete logic is split across many processes which implement dedicated micro services (e.g. configuration, logging, task scheduler, ...).
If something goes wrong with NtQuerySystemInformation at some point I can follow up with Microsoft to determine the exact conditions. What is the exact call that could fail on Windows?
I could write a repro and check if this consistently fails in another process as well and then take a kernel dump to let the MS guys figure it out. I do not want to work around OS bugs unless it is unavoidable.
@Jim: That is a nice scheduling approach but I have far too many threads which could be used by a latency critical UI task which would still result in a blocked/ not updated UI. I would first like to completely understand the issue and then try to fix it. If that fails I have to measure the impact on the system how bad OpenMP really makes it. If that is understood I need to find a low impact workaround without introducing too many changes.
Alois
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for the sample. I wll let it run in a loop to check if it fails at the same time as our application.
Alois
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have got the condition once again.
The test application reports this:
Illegal info 3: 343597826032...
Illegal info 3: 343597826032...
Illegal info 3: 343597826032...
Illegal info 3: 343597826112...
Illegal info 3: 343597826112...
Illegal info 3: 343597826032...
Illegal info 3: 343597826112...
Illegal info 3: 343597826192...
Illegal info 3: 343597826192...
Illegal info 3: 343597826272...
Illegal info 3: 343597826032...
Illegal info 3: 343597826032...
Illegal info 3: 343597826032...
Illegal info 3: 343597826032...
Illegal info 3: 343597825952...
Illegal info 3: 343597825952...
Illegal info 3: 343597825952...
Illegal info 3: 343597826032...
Illegal info 3: 343597826032...
I let it run in a loop for every 2 seconds. Is this Output helpful? I try to get full kernel dump tomorrow if the Situation still persists.
Yours,
Alois Kraus
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page