Re: OMP efficiency

tihomir · ‎10-12-2006

I am running an algorithm that is easy to split in two threads on a two processor machine. The task manager shows that the two processors are loaded 100% of the time. There issmall communication between the two threads. Once in a while they have to coordinate. For a computation that say takes a minute thereis asmall information exchange about10 times. My expectation was that I would achieve nearly 50% reduction in the computational time when compared to running the algorithm on a single processor -- but this didn't happen, I only got 20% reduction. When I slit the algorithm in two separate executables I do get 50% reduction. Any ideas for where the remaining 30% go?

Steven_L_Intel1 · ‎10-12-2006

This is a job for the Intel Thread Profiler. It will show you exactly what is going on and when/where your threads are stalling. Last week I took some training using this and I was very impressed.

tihomir · ‎10-26-2006

Steve

Thanks for the suggestion. I found that two omp functions take about 20% of the time (the remaining 10% seems to be spread around smaller different places)

__kmp_x86_pause

__kmp_wait_sleep

Is there anything I can do to minimize the impact of these two?

By the way, do you know how to get total clock ticks for the functions including the calls within the function?

Thanks a lot for any insight.

Tihomir

tihomir · ‎10-26-2006

Actually a bit more info on my situation. The two threads are absolutely symmetric and should take exactly the same time to get done, so there should be no waiting and pausing anywhere. Could I benefit from sendingthe threads to a prespecified CPU?

Steven_L_Intel1 · ‎10-27-2006

Intel VTune can get you clock ticks per function. The Thread Profiler can show you where threads are waiting and the flow of control.

tihomir · ‎10-27-2006

Steve

I think there is a problem with the efficiency of OMP. I loose 20% of the CPU on some wait and pause routines. That means that my second CPU is working on 60% efficiency. For an algorithm that can simply be run in two separate executables this doesn't look good at all.

Could you advise me on where I can submit an example of this OMP related problem.

Thank you

Tihomir

jimdempseyatthecove · ‎10-27-2006

Tihomir,

Can you diagram the !$OMP directives of your application?

From my experience on a 4 core system if you can get the parallel sections to perform a sizable amount of work for each thread then the overhead is relatively small.

Since you can run two instances of your application try the following "test". If you find it satisfactory then you have a good starting point for adapting your application.

Make the PROGRAM simply perform

program YourProgramName
use omp_lib
!$OMP PARALLEL
call YourProgramNameRenamed(OMP_GET_THREAD_NUM())
!$OMP END PARALLEL
end program YourProgramName

subroutine YourProgramNameRenamed(instance)
integer :: instance
! your old pragam here
! slightly modified to use "instance" for data set selection
! No OpenMP directives (yet)
...
end subroutine YourProgramNameRenamed

If you get satisfactory runtimes then start inserting OpenMP directives into other sections.

Use a profiler
Intel VTune
AMD CodeAnalyst (I use this one)

Don't go overboard by parallelizing every loop - work from outwards to inwards if possible.

OpenMP is quite good using coarse granularity (large run times between context switches).

With my simulation application I can keep the 4 cores running at ~85%. There are some places I can tweek yet and get a few more percent. At some point you reach a diminishing return.

I am looking at a new technique that I am developing that will help an OpenMP applicaiton (as well as other threaded techniques)to use finer granularity without suffering undue overhead performing operating system overhead calls for thread managements. When I get the code firmed up and a suitable set of test data I will write a white paper and place it on this forum (assuming Intel won't object).

Jim Dempsey

Steven_L_Intel1 · ‎10-27-2006

Tihomir,

If you believe there is a product problem, you can submit an issue to Intel Premier Support.However, it is not clear that the issue is the OpenMP implementation, and I think you should spend some more time analyzing the application behavior.

tihomir · ‎11-03-2006

Hi,

Just a little update on this. One of the issues is that the chunck of code that is done in parallel is relatively small. In 1 minute I started 500 independent threads and that seems to be the issue if I group the threads in chunks of 10 I then start 50 independent pairs of threads in 1 minute and I get much better efficiency. This is in agreementwith Jim Dempsey post.

A second issue that came up is that there is a difference in muthithreading processors v.s. dual core processors. The two routines

__kmp_x86_pause

__kmp_wait_sleep

actually are not sucking up CPU on the muthithreading CPU but they are on the dual core. Any insights for this difference?

Tihomir

jimdempseyatthecove · ‎11-03-2006

This may be an oversight in the OpenMP library. Lookup

FUNCTION KMP_SET_BLOCKTIME(msec)
INTEGER msec

Try using 0 for the block time.