Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.
29234 Discussions

Unable to exploit multicore processing after OpenMP program is compiled

skiffzzz
Beginner
975 Views
Dear all, I am using IVF 11.0.072 in MS visual studio 2008. I enabled "Generate Parallel code/QopenMP".

After I inserted the OpenMP directives (!$omp parallel do) in the loop, the information after debuging is "OpenMP DEFINED LOOP WAS PARALLELIZED", however, the CPU usage is still only 50%, I am using a dual-core CPU. So there is no speed up at all. I don't understand why this weird situation happens.

Thanks so much.
Skiff
0 Kudos
5 Replies
Alexander_C_Intel
975 Views
Quoting - skiffzzz
Dear all, I am using IVF 11.0.072 in MS visual studio 2008. I enabled "Generate Parallel code/QopenMP".

After I inserted the OpenMP directives (!$omp parallel do) in the loop, the information after debuging is "OpenMP DEFINED LOOP WAS PARALLELIZED", however, the CPU usage is still only 50%, I am using a dual-core CPU. So there is no speed up at all. I don't understand why this weird situation happens.

Thanks so much.
Skiff

Hi Skiff,

Are you sure that the loop you have parallelized is a hot loop i.e. you program spend most time there? If you optimize a piece of code which is not intensively executed you may not get significant improvement.

Also what is the tripcount in that loop? Usually loops with low tricpount are not good canidates for parallelization.

-Alex
0 Kudos
TimP
Honored Contributor III
975 Views

Are you sure that the loop you have parallelized is a hot loop i.e. you program spend most time there? If you optimize a piece of code which is not intensively executed you may not get significant improvement.

Also what is the tripcount in that loop? Usually loops with low tricpount are not good canidates for parallelization.

-Alex
Alex raises important points.
Did you look at the report which is stored in guide.gvs when you set /Qopenmp-profile ? This will show you how much time is spent in the parallel loop and how much of that is accounted for by several overhead categories, for each thread. Use it to check whether your KMP_AFFINITY settings are working, particularly if you have a multiple cache platform. You should be able to get this information into a pretty graphic display by running thread profiler, or by importing guide.gvs into VTune.
You don't indicate whether you checked that the loop is vectorized for both the OpenMP and non-OpenMP cases. First, setting omp parallel prevents the compiler from performing loop interchange optimizations. Second, the compiler has been known to drop vectorization when parallelizing. People who want to brag about parallel speedup often defeat vectorization intentionally (or even set /Od) so as to get a lower performing base for comparison.
If you have a NUMA platform (multiple socket AMD or Nehalem), data locality must be considered. If the data are initialized in a separate loop from the main working loop, both loops should be parallelized with identical allocations of data to threads. Cases which perform well with schedule(guided) or dynamic on a non-NUMA platform may benefit from explicitly allocating equal chunks of contiguous work to each thread.
0 Kudos
skiffzzz
Beginner
975 Views

Hi Skiff,

Are you sure that the loop you have parallelized is a hot loop i.e. you program spend most time there? If you optimize a piece of code which is not intensively executed you may not get significant improvement.

Also what is the tripcount in that loop? Usually loops with low tricpount are not good canidates for parallelization.

-Alex
Thanks Alex,
I checked the computing amount inside the loop, I found that it doesn't process the biggest work load. I just tried to increase the computing amount of the loop, the cpu usage rises to 100%.
Now, I am wondering how to judge whether or not the piece of code is intensively executed and what tripcount is a good candidate for parallelization. If the parallelizable loop's computing amount can't match other sequential loops' in the code, does this mean that this loop can not be processed in a parallel way?

0 Kudos
skiffzzz
Beginner
975 Views
Quoting - tim18
Alex raises important points.
Did you look at the report which is stored in guide.gvs when you set /Qopenmp-profile ? This will show you how much time is spent in the parallel loop and how much of that is accounted for by several overhead categories, for each thread. Use it to check whether your KMP_AFFINITY settings are working, particularly if you have a multiple cache platform. You should be able to get this information into a pretty graphic display by running thread profiler, or by importing guide.gvs into VTune.
You don't indicate whether you checked that the loop is vectorized for both the OpenMP and non-OpenMP cases. First, setting omp parallel prevents the compiler from performing loop interchange optimizations. Second, the compiler has been known to drop vectorization when parallelizing. People who want to brag about parallel speedup often defeat vectorization intentionally (or even set /Od) so as to get a lower performing base for comparison.
If you have a NUMA platform (multiple socket AMD or Nehalem), data locality must be considered. If the data are initialized in a separate loop from the main working loop, both loops should be parallelized with identical allocations of data to threads. Cases which perform well with schedule(guided) or dynamic on a non-NUMA platform may benefit from explicitly allocating equal chunks of contiguous work to each thread.
Hi, Tim, can you tell me where I can find guide.gvs. I am using Windows XP and visual studion 2008. Do I have to install some other software to get the information?
Following Alex's advice, I found that when I increased the computing amount of the loop, it finally became parallelized.
0 Kudos
TimP
Honored Contributor III
975 Views
When you have linked with /Qopenmp-profile, guide.gvs is written in the active directory when the run completes. It's a readable text file, or can be plotted the old way by importing into VTune. With linking either by /Qopenmp or /Qprofile, graphic data can be collected by the Thread Profiler add-on to VTune. None of those graphs tells you any more than the text file, although they help you make PowerPoints quicker.
0 Kudos
Reply