- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
icc -openmp -fast aneto_jajaopt2_omp.cpp -o aneto_jajaopt2_omp -lm
aneto_jajaopt2_omp.cpp(193): warning #267: the format string requires additional arguments
printf("WARNING in %d: Could not set CPU Affinity with CPU[%d]...\\n", myid);
^
aneto_jajaopt2_omp.cpp(261): warning #267: the format string requires additional arguments
printf("WARNING in %d: Could not set CPU Affinity with CPU[%d]...\\n", myid);
^
aneto_jajaopt2_omp.cpp(361): warning #181: argument is incompatible with corresponding format string conversion
scanf("%ld", &num_elements);
^
ipo: remark #11001: performing single-file optimizations
ipo: remark #11005: generating object file /tmp/ipo_icc8tK5Qk.o
aneto_jajaopt2_omp.cpp(468): (col. 3) remark: OpenMP DEFINED LOOP WAS PARALLELIZED.
aneto_jajaopt2_omp.cpp(468): (col. 3) remark: OpenMP DEFINED LOOP WAS PARALLELIZED.
aneto_jajaopt2_omp.cpp(445): (col. 23) remark: LOOP WAS VECTORIZED.
aneto_jajaopt2_omp.cpp(446): (col. 28) remark: LOOP WAS VECTORIZED.
I enclose my file with the code.
aneto_jajaopt2_omp.cpp
As you can see, I have 2 parallel regions with OpenMP: one into the LocalRankingPhase function and another one into the GloballRankingPhase function. I have written in the display the threads ids with omp_get_thread_num() into each region and I saw that they were correct, but I can't understand why I don't have better time with more threads.
Thanks for all
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
M.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Another curiousity question: did you measure g++ performance only with optimizations off?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'm not seeing your code scale with either g++ or icc.
It's important for a performance issue to state what the OS is, what the machine is, and what compiler version you used.
I measured the performance on a 4 way Intel Core i5 box running SLES11 x86_64.
Did I run the program correctly?
My results with g++:
> g++ --version
g++ (GCC) 4.5.0 20090924 (experimental) [trunk revision 152147]
Copyright (C) 2009 Free Software Foundation, Inc.
> g++ -O3 -fopenmp aneto_jajaopt2_omp.cpp -o aneto_jajaopt2_omp
> export OMP_NUM_THREADS=1
> time ./aneto_jajaopt2_omp 1000000 1 100 1
2,
real 1m26.009s
user 1m25.973s
sys 0m0.036s
> export OMP_NUM_THREADS=4
> time ./aneto_jajaopt2_omp 1000000 4 100 1
3,
real 1m35.423s
user 2m3.196s
sys 0m0.044s
My results withicc:
> icc -V
Intel C Intel 64 Compiler Professional for applications running on Intel 64, Version 11.1 Build 20100806 Package ID: l_cproc_p_11.1.073
Copyright (C) 1985-2010 Intel Corporation. All rights reserved.
> icc -O3 -openmp aneto_jajaopt2_omp.cpp -o aneto_jajaopt2_omp -lm -wd267 -wd181
> export OMP_NUM_THREADS=1
> time ./aneto_jajaopt2_omp 1000000 1 100 1
2,
real 1m31.023s
user 1m29.410s
sys 0m0.012s
> export OMP_NUM_THREADS=4
> time ./aneto_jajaopt2_omp 1000000 4 100 1
2,
real 1m34.398s
user 2m5.608s
sys 0m0.844s
>
Thank you,
Patrick Kennedy
Intel Developer Support
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have not attempted to run your code, however I have a few comments:
Inside localRankingPhase and globalRankingPhase you have code to (try) to set (pin) thread affinity to the omp_get_thread_num(). And this is performed inside a for loop, meaning on each iteration you are resetting the affinity to the current affinity (useless codeafter first call). You also are not resetting thread affinities to what they were before. I suggest that you move the pinning code to a parallel region in main, just following the omp_set_num_threads or later but prior to calling function containing first parallel region intended to be pinned.
#pragma
omp parallel
{
... // do pinning here
}
A second issue with the code is you assume "processor" omp_get_thread_num() is in your permitted "processor" list. What would happen to your code should the system administrator set a pollicy that only code with root privledges have permission to run on "processor" 0?
Your code, as written, assumes "processors" 0:act_num_threads-1 are available. While this may be the case on the preponderance of systems you test your code on, it is not necessarily the case on all systems. Therefore, after getting your code running well on your system, I suggest you enhance the code to pin threads 0:act_num_threads-1 relative to the available processors bit positions. e.g. your omp thread num 0 runs on the least significant permitted "processor" your 1 on the next available "processor" ...
Jim Dempsey
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page