Unable to exploit multicore processing after OpenMP program is compiled

skiffzzz · ‎12-09-2009

Dear all, I am using IVF 11.0.072 in MS visual studio 2008. I enabled "Generate Parallel code/QopenMP".

After I inserted the OpenMP directives (!$omp parallel do) in the loop, the information after debuging is "OpenMP DEFINED LOOP WAS PARALLELIZED", however, the CPU usage is still only 50%, I am using a dual-core CPU. So there is no speed up at all. I don't understand why this weird situation happens.

Thanks so much.
Skiff

Alexander_C_Intel · ‎12-09-2009

Quoting - skiffzzz

Dear all, I am using IVF 11.0.072 in MS visual studio 2008. I enabled "Generate Parallel code/QopenMP".

After I inserted the OpenMP directives (!$omp parallel do) in the loop, the information after debuging is "OpenMP DEFINED LOOP WAS PARALLELIZED", however, the CPU usage is still only 50%, I am using a dual-core CPU. So there is no speed up at all. I don't understand why this weird situation happens.

Thanks so much.
Skiff

Hi Skiff,

Are you sure that the loop you have parallelized is a hot loop i.e. you program spend most time there? If you optimize a piece of code which is not intensively executed you may not get significant improvement.

Also what is the tripcount in that loop? Usually loops with low tricpount are not good canidates for parallelization.

-Alex

TimP · ‎12-09-2009

Quoting - Alexander Chaiko (Intel)

Are you sure that the loop you have parallelized is a hot loop i.e. you program spend most time there? If you optimize a piece of code which is not intensively executed you may not get significant improvement.

Also what is the tripcount in that loop? Usually loops with low tricpount are not good canidates for parallelization.

-Alex

Alex raises important points.
Did you look at the report which is stored in guide.gvs when you set /Qopenmp-profile ? This will show you how much time is spent in the parallel loop and how much of that is accounted for by several overhead categories, for each thread. Use it to check whether your KMP_AFFINITY settings are working, particularly if you have a multiple cache platform. You should be able to get this information into a pretty graphic display by running thread profiler, or by importing guide.gvs into VTune.
You don't indicate whether you checked that the loop is vectorized for both the OpenMP and non-OpenMP cases. First, setting omp parallel prevents the compiler from performing loop interchange optimizations. Second, the compiler has been known to drop vectorization when parallelizing. People who want to brag about parallel speedup often defeat vectorization intentionally (or even set /Od) so as to get a lower performing base for comparison.
If you have a NUMA platform (multiple socket AMD or Nehalem), data locality must be considered. If the data are initialized in a separate loop from the main working loop, both loops should be parallelized with identical allocations of data to threads. Cases which perform well with schedule(guided) or dynamic on a non-NUMA platform may benefit from explicitly allocating equal chunks of contiguous work to each thread.

skiffzzz · ‎12-09-2009

Quoting - Alexander Chaiko (Intel)

Hi Skiff,

Are you sure that the loop you have parallelized is a hot loop i.e. you program spend most time there? If you optimize a piece of code which is not intensively executed you may not get significant improvement.

Also what is the tripcount in that loop? Usually loops with low tricpount are not good canidates for parallelization.

-Alex

Thanks Alex,
I checked the computing amount inside the loop, I found that it doesn't process the biggest work load. I just tried to increase the computing amount of the loop, the cpu usage rises to 100%.
Now, I am wondering how to judge whether or not the piece of code is intensively executed and what tripcount is a good candidate for parallelization. If the parallelizable loop's computing amount can't match other sequential loops' in the code, does this mean that this loop can not be processed in a parallel way?

skiffzzz · ‎12-09-2009

Quoting - tim18

Alex raises important points.
Did you look at the report which is stored in guide.gvs when you set /Qopenmp-profile ? This will show you how much time is spent in the parallel loop and how much of that is accounted for by several overhead categories, for each thread. Use it to check whether your KMP_AFFINITY settings are working, particularly if you have a multiple cache platform. You should be able to get this information into a pretty graphic display by running thread profiler, or by importing guide.gvs into VTune.
You don't indicate whether you checked that the loop is vectorized for both the OpenMP and non-OpenMP cases. First, setting omp parallel prevents the compiler from performing loop interchange optimizations. Second, the compiler has been known to drop vectorization when parallelizing. People who want to brag about parallel speedup often defeat vectorization intentionally (or even set /Od) so as to get a lower performing base for comparison.
If you have a NUMA platform (multiple socket AMD or Nehalem), data locality must be considered. If the data are initialized in a separate loop from the main working loop, both loops should be parallelized with identical allocations of data to threads. Cases which perform well with schedule(guided) or dynamic on a non-NUMA platform may benefit from explicitly allocating equal chunks of contiguous work to each thread.

Hi, Tim, can you tell me where I can find guide.gvs. I am using Windows XP and visual studion 2008. Do I have to install some other software to get the information?
Following Alex's advice, I found that when I increased the computing amount of the loop, it finally became parallelized.

TimP · ‎12-09-2009

When you have linked with /Qopenmp-profile, guide.gvs is written in the active directory when the run completes. It's a readable text file, or can be plotted the old way by importing into VTune. With linking either by /Qopenmp or /Qprofile, graphic data can be collected by the Thread Profiler add-on to VTune. None of those graphs tells you any more than the text file, although they help you make PowerPoints quicker.