- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Steve
Thanks for the suggestion. I found that two omp functions take about 20% of the time (the remaining 10% seems to be spread around smaller different places)
__kmp_x86_pause
__kmp_wait_sleep
Is there anything I can do to minimize the impact of these two?
By the way, do you know how to get total clock ticks for the functions including the calls within the function?
Thanks a lot for any insight.
Tihomir
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Actually a bit more info on my situation. The two threads are absolutely symmetric and should take exactly the same time to get done, so there should be no waiting and pausing anywhere. Could I benefit from sendingthe threads to a prespecified CPU?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Steve
I think there is a problem with the efficiency of OMP. I loose 20% of the CPU on some wait and pause routines. That means that my second CPU is working on 60% efficiency. For an algorithm that can simply be run in two separate executables this doesn't look good at all.
Could you advise me on where I can submit an example of this OMP related problem.
Thank you
Tihomir
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Tihomir,
Can you diagram the !$OMP directives of your application?
From my experience on a 4 core system if you can get the parallel sections to perform a sizable amount of work for each thread then the overhead is relatively small.
Since you can run two instances of your application try the following "test". If you find it satisfactory then you have a good starting point for adapting your application.
Make the PROGRAM simply perform
program YourProgramName
use omp_lib
!$OMP PARALLEL
call YourProgramNameRenamed(OMP_GET_THREAD_NUM())
!$OMP END PARALLEL
end program YourProgramName
subroutine YourProgramNameRenamed(instance)
integer :: instance
! your old pragam here
! slightly modified to use "instance" for data set selection
! No OpenMP directives (yet)
...
end subroutine YourProgramNameRenamed
If you get satisfactory runtimes then start inserting OpenMP directives into other sections.
Use a profiler
Intel VTune
AMD CodeAnalyst (I use this one)
Don't go overboard by parallelizing every loop - work from outwards to inwards if possible.
OpenMP is quite good using coarse granularity (large run times between context switches).
With my simulation application I can keep the 4 cores running at ~85%. There are some places I can tweek yet and get a few more percent. At some point you reach a diminishing return.
I am looking at a new technique that I am developing that will help an OpenMP applicaiton (as well as other threaded techniques)to use finer granularity without suffering undue overhead performing operating system overhead calls for thread managements. When I get the code firmed up and a suitable set of test data I will write a white paper and place it on this forum (assuming Intel won't object).
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Tihomir,
If you believe there is a product problem, you can submit an issue to Intel Premier Support.However, it is not clear that the issue is the OpenMP implementation, and I think you should spend some more time analyzing the application behavior.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Just a little update on this. One of the issues is that the chunck of code that is done in parallel is relatively small. In 1 minute I started 500 independent threads and that seems to be the issue if I group the threads in chunks of 10 I then start 50 independent pairs of threads in 1 minute and I get much better efficiency. This is in agreementwith Jim Dempsey post.
A second issue that came up is that there is a difference in muthithreading processors v.s. dual core processors. The two routines
__kmp_x86_pause
__kmp_wait_sleep
actually are not sucking up CPU on the muthithreading CPU but they are on the dual core. Any insights for this difference?
Tihomir
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
FUNCTION KMP_SET_BLOCKTIME(msec)
INTEGER msec
Try using 0 for the block time.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page