- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have here what I understand to be a relatively simple OpenMP construct. The issue is that the program runs about 100-300x faster with 1 thread when compared to 2 threads. 87% of the program is spent in gomp _ send _ wait() and another 9.5% in gomp _ send _ post (compiled with ICC). The program gives correct results, but I wonder if there is a flaw in the code that is causing some resource conflict, or if it is simply that the overhead of the thread creation is drastically not worth it for a loop of this size. The thing is, I have seen examples of much thinner loops that have been parallelized for a gain in online resources. This makes me think that there is actually a flaw in the code. As you can see, I have tried to privatize everything that I can. I've been using 2 threads to test this (on an 8 processor testbed). Oh, and I should mention that this program is already running in MPI on a different level of parallelization.
[cpp] double e_t_sum=0.0; double e_in_sum=0.0; int nthreads,tid; [/cpp]
[cpp] #pragma omp parallel for reduction(+ : e_t_sum, e_in_sum) shared(ee_t) private(tid, i, d_x, d_y, d_z, rr,) firstprivate( V_in, t_x, t_y, t_z) lastprivate(nthreads) for (i = 0; i < c; i++){ nthreads = omp_get_num_threads(); tid = omp_get_thread_num(); d_x = V_in.x - t_x; d_y = V_in.y - t_y; d_z = V_in.z - t_z; rr = d_x * d_x + d_y * d_y + d_z * d_z; ee_t...= energy(rr, V_in.q, V_in .q, V_in.s, V_in .s); e_t_sum += ee_t ; e_in_sum += ee_in ; } } //end parallel for [/cpp]
e_t += e_t_sum;
e_t -= e_in_sum;
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have here what I understand to be a relatively simple OpenMP construct. The issue is that the program runs about 100-300x faster with 1 thread when compared to 2 threads. 87% of the program is spent in gomp _ send _ wait() and another 9.5% in gomp _ send _ post (compiled with ICC). The program gives correct results, but I wonder if there is a flaw in the code that is causing some resource conflict, or if it is simply that the overhead of the thread creation is drastically not worth it for a loop of this size. The thing is, I have seen examples of much thinner loops that have been parallelized for a gain in online resources. This makes me think that there is actually a flaw in the code. As you can see, I have tried to privatize everything that I can. I've been using 2 threads to test this (on an 8 processor testbed). Oh, and I should mention that this program is already running in MPI on a different level of parallelization.
There do appear to be some strange bits in this code that I'd like to ask about. Are nthreads or tid actually used in the parallel for section? Other than their definitions, they don't appear to be used in this fragment. It also looks like V_in and the temporary t_? variables are not modified in this fragment, yet they are declared private (firstprivate), which will force private copies of them. Probably not a problem for the t_? but V_in looks like it might be big, andmay incur some expense to copy. If you don't need the private copies, it's probably a waste to create them.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
There do appear to be some strange bits in this code that I'd like to ask about. Are nthreads or tid actually used in the parallel for section? Other than their definitions, they don't appear to be used in this fragment. It also looks like V_in and the temporary t_? variables are not modified in this fragment, yet they are declared private (firstprivate), which will force private copies of them. Probably not a problem for the t_? but V_in looks like it might be big, andmay incur some expense to copy. If you don't need the private copies, it's probably a waste to create them.
Hello Masonk,
Agree with 2 previous comments.
pragmatic tricks:
use Intel MPI 3.2 for Hybrid MPI/openmpand setenvironment variables as:
export I_MPI_PIN_DOMAIN=omp
export KMP_AFFINITY=verbose,compact
export OMP_NUM_THREADS=xx
and don't overload your machine: OMP_NUM_THREADS * MPI threads <= cores in your hardware configuration
programing advice:
use "default(none)" in openmp, it will force you to scope all your variables as private or shared (and ask yourself the good questions for performace improvement) ==> quite good automatic debug :=) :=)
Cheers.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello Masonk,
Agree with 2 previous comments.
pragmatic tricks:
use Intel MPI 3.2 for Hybrid MPI/openmpand setenvironment variables as:
export I_MPI_PIN_DOMAIN=omp
export KMP_AFFINITY=verbose,compact
export OMP_NUM_THREADS=xx
and don't overload your machine: OMP_NUM_THREADS * MPI threads <= cores in your hardware configuration
programing advice:
use "default(none)" in openmp, it will force you to scope all your variables as private or shared (and ask yourself the good questions for performace improvement) ==> quite good automatic debug :=) :=)
Cheers.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
...
Is your energy() function just a simple function, or are there cut-offs that make large differences in calculation times? If it varies significantly, you may need to use the SCHEDULE() directive.
Joe
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page