Problems with the parrallel regions in OpenMP

Hugo_Maria_Vegas_Car · ‎09-03-2010

Hello fellows, I have a problem with my code using the Intel Compiler. My problem is the next one: I have implemented a parallel algorithm using OpenMP. With the g++ compiler I obtained a good SpeedUp but when I use the Intel Compiler I find that the execution time is better than in g++ but I don't have any speed up. Don't matter how much threads I use, I get always the same time values. When I compile, the display shows the following messages:

icc -openmp -fast aneto_jajaopt2_omp.cpp -o aneto_jajaopt2_omp -lm

aneto_jajaopt2_omp.cpp(193): warning #267: the format string requires additional arguments
printf("WARNING in %d: Could not set CPU Affinity with CPU[%d]...\\n", myid);
^

aneto_jajaopt2_omp.cpp(261): warning #267: the format string requires additional arguments
printf("WARNING in %d: Could not set CPU Affinity with CPU[%d]...\\n", myid);
^

aneto_jajaopt2_omp.cpp(361): warning #181: argument is incompatible with corresponding format string conversion
scanf("%ld", &num_elements);
^

ipo: remark #11001: performing single-file optimizations
ipo: remark #11005: generating object file /tmp/ipo_icc8tK5Qk.o
aneto_jajaopt2_omp.cpp(468): (col. 3) remark: OpenMP DEFINED LOOP WAS PARALLELIZED.
aneto_jajaopt2_omp.cpp(468): (col. 3) remark: OpenMP DEFINED LOOP WAS PARALLELIZED.
aneto_jajaopt2_omp.cpp(445): (col. 23) remark: LOOP WAS VECTORIZED.
aneto_jajaopt2_omp.cpp(446): (col. 28) remark: LOOP WAS VECTORIZED.

I enclose my file with the code.

aneto_jajaopt2_omp.cpp

As you can see, I have 2 parallel regions with OpenMP: one into the LocalRankingPhase function and another one into the GloballRankingPhase function. I have written in the display the threads ids with omp_get_thread_num() into each region and I saw that they were correct, but I can't understand why I don't have better time with more threads.

Thanks for all

Massimiliano_Culpo · ‎09-04-2010

Hi, a question just out of curiosity: have you tried using OpenMP run-time routines for measuring performances? I wonder if "omp_get_wtick()" or "omp_get_wtime()" give the same answer as "gettimeofday".

M.

TimP · ‎09-04-2010

omp_get_wtime() is likely based on the same system information as gettimeofday().
Another curiousity question: did you measure g++ performance only with optimizations off?

Hugo_Maria_Vegas_Car · ‎09-05-2010

Hi. I have measured the g++ preformance with -O3, and the intel performance with -fast and -O3. I haven't try to use omp_get_wtime(). Thanks

pbkenned1 · ‎09-08-2010

Hello Hugo,
I'm not seeing your code scale with either g++ or icc.

It's important for a performance issue to state what the OS is, what the machine is, and what compiler version you used.

I measured the performance on a 4 way Intel Core i5 box running SLES11 x86_64.

Did I run the program correctly?

My results with g++:

> g++ --version

g++ (GCC) 4.5.0 20090924 (experimental) [trunk revision 152147]

> g++ -O3 -fopenmp aneto_jajaopt2_omp.cpp -o aneto_jajaopt2_omp

> export OMP_NUM_THREADS=1

> time ./aneto_jajaopt2_omp 1000000 1 100 1

2,

real 1m26.009s

user 1m25.973s

sys 0m0.036s

> export OMP_NUM_THREADS=4

> time ./aneto_jajaopt2_omp 1000000 4 100 1

3,

real 1m35.423s

user 2m3.196s

sys 0m0.044s

My results withicc:

> icc -V

Intel C Intel 64 Compiler Professional for applications running on Intel 64, Version 11.1 Build 20100806 Package ID: l_cproc_p_11.1.073

> icc -O3 -openmp aneto_jajaopt2_omp.cpp -o aneto_jajaopt2_omp -lm -wd267 -wd181

> export OMP_NUM_THREADS=1

> time ./aneto_jajaopt2_omp 1000000 1 100 1

2,

real 1m31.023s

user 1m29.410s

sys 0m0.012s

> export OMP_NUM_THREADS=4

> time ./aneto_jajaopt2_omp 1000000 4 100 1

2,

real 1m34.398s

user 2m5.608s

sys 0m0.844s

>

Thank you,
Patrick Kennedy
Intel Developer Support

jimdempseyatthecove · ‎09-09-2010

Hugo,

I have not attempted to run your code, however I have a few comments:

Inside localRankingPhase and globalRankingPhase you have code to (try) to set (pin) thread affinity to the omp_get_thread_num(). And this is performed inside a for loop, meaning on each iteration you are resetting the affinity to the current affinity (useless codeafter first call). You also are not resetting thread affinities to what they were before. I suggest that you move the pinning code to a parallel region in main, just following the omp_set_num_threads or later but prior to calling function containing first parallel region intended to be pinned.

#pragma

omp parallel
{
... // do pinning here
}

A second issue with the code is you assume "processor" omp_get_thread_num() is in your permitted "processor" list. What would happen to your code should the system administrator set a pollicy that only code with root privledges have permission to run on "processor" 0?

Your code, as written, assumes "processors" 0:act_num_threads-1 are available. While this may be the case on the preponderance of systems you test your code on, it is not necessarily the case on all systems. Therefore, after getting your code running well on your system, I suggest you enhance the code to pin threads 0:act_num_threads-1 relative to the available processors bit positions. e.g. your omp thread num 0 runs on the least significant permitted "processor" your 1 on the next available "processor" ...

Jim Dempsey