Terrible OpenMP performance - a simple problem of overhead, or a sinister logic flaw?

masonk · ‎05-15-2009

Hi there. First post. Just to disclaim this: I am a pretend programmer. I'm a biophysics researcher. But, I make use of the marvelous hardware and software that Intel provides in order to do biophysics, and that's why I'm here. I hope I'm in the right place.

I have here what I understand to be a relatively simple OpenMP construct. The issue is that the program runs about 100-300x faster with 1 thread when compared to 2 threads. 87% of the program is spent in gomp _ send _ wait() and another 9.5% in gomp _ send _ post (compiled with ICC). The program gives correct results, but I wonder if there is a flaw in the code that is causing some resource conflict, or if it is simply that the overhead of the thread creation is drastically not worth it for a loop of this size. The thing is, I have seen examples of much thinner loops that have been parallelized for a gain in online resources. This makes me think that there is actually a flaw in the code. As you can see, I have tried to privatize everything that I can. I've been using 2 threads to test this (on an 8 processor testbed). Oh, and I should mention that this program is already running in MPI on a different level of parallelization.

[cpp]		
		double e_t_sum=0.0;
		double e_in_sum=0.0;

		int nthreads,tid;
		[/cpp]

[cpp]		#pragma omp parallel for reduction(+ : e_t_sum, e_in_sum) shared(ee_t) private(tid, i, d_x, d_y, d_z, rr,) firstprivate( V_in, t_x, t_y, t_z) lastprivate(nthreads)
		for (i = 0; i < c; i++){
				nthreads = omp_get_num_threads();				
				tid = omp_get_thread_num();

				d_x = V_in.x - t_x; 
				d_y = V_in.y - t_y;
				d_z = V_in.z - t_z;


				rr = d_x * d_x + d_y * d_y + d_z * d_z;

					ee_t = energy(rr, V_in.q, V_in.q, V_in.s, V_in.s);
					e_t_sum += ee_t; 
					e_in_sum += ee_in;	
	

			}
		}	//end parallel for	[/cpp]

...

e_t += e_t_sum;
e_t -= e_in_sum;

TimP · ‎05-16-2009

If you are running "hybrid" MPI/OpenMP, MPI standard, and some implementations, require the use of MPI_Init_thread() to specify and check for support of your thread model. Also, you need an appropriate affinity scheme. For example, if you have 2 MPI processes on a dual CPU node, all threads of each process would be assigned to the cores of one CPU. If the MPI doesn't have a built-in provision for affinity for your threading model (as current Intel and HP MPI do), you must handle the non-conflicting affinities of each process yourself, or be limited to 1 MPI process per node. It is likely more important than in plain OpenMP to arrange for data locality (preferably vectorizability) of inner loops. In fact, you would likely gain more from vectorization first before adding threading in your MPI.

robert-reed · ‎05-17-2009

Quoting - masonk

Hi there. First post. Just to disclaim this: I am a pretend programmer. I'm a biophysics researcher. But, I make use of the marvelous hardware and software that Intel provides in order to do biophysics, and that's why I'm here. I hope I'm in the right place.

I have here what I understand to be a relatively simple OpenMP construct. The issue is that the program runs about 100-300x faster with 1 thread when compared to 2 threads. 87% of the program is spent in gomp _ send _ wait() and another 9.5% in gomp _ send _ post (compiled with ICC). The program gives correct results, but I wonder if there is a flaw in the code that is causing some resource conflict, or if it is simply that the overhead of the thread creation is drastically not worth it for a loop of this size. The thing is, I have seen examples of much thinner loops that have been parallelized for a gain in online resources. This makes me think that there is actually a flaw in the code. As you can see, I have tried to privatize everything that I can. I've been using 2 threads to test this (on an 8 processor testbed). Oh, and I should mention that this program is already running in MPI on a different level of parallelization.

There do appear to be some strange bits in this code that I'd like to ask about. Are nthreads or tid actually used in the parallel for section? Other than their definitions, they don't appear to be used in this fragment. It also looks like V_in and the temporary t_? variables are not modified in this fragment, yet they are declared private (firstprivate), which will force private copies of them. Probably not a problem for the t_? but V_in looks like it might be big, andmay incur some expense to copy. If you don't need the private copies, it's probably a waste to create them.

Alain_D_Intel · ‎05-18-2009

Quoting - Robert Reed (Intel)

There do appear to be some strange bits in this code that I'd like to ask about. Are nthreads or tid actually used in the parallel for section? Other than their definitions, they don't appear to be used in this fragment. It also looks like V_in and the temporary t_? variables are not modified in this fragment, yet they are declared private (firstprivate), which will force private copies of them. Probably not a problem for the t_? but V_in looks like it might be big, andmay incur some expense to copy. If you don't need the private copies, it's probably a waste to create them.

Hello Masonk,

Agree with 2 previous comments.

pragmatic tricks:

use Intel MPI 3.2 for Hybrid MPI/openmpand setenvironment variables as:
export I_MPI_PIN_DOMAIN=omp
export KMP_AFFINITY=verbose,compact
export OMP_NUM_THREADS=xx
and don't overload your machine: OMP_NUM_THREADS * MPI threads <= cores in your hardware configuration

programing advice:
use "default(none)" in openmp, it will force you to scope all your variables as private or shared (and ask yourself the good questions for performace improvement) ==> quite good automatic debug :=) :=)

Cheers.

TimP · ‎05-18-2009

Quoting - Alain Dominguez (Intel)

Hello Masonk,

Agree with 2 previous comments.

pragmatic tricks:

use Intel MPI 3.2 for Hybrid MPI/openmpand setenvironment variables as:
export I_MPI_PIN_DOMAIN=omp
export KMP_AFFINITY=verbose,compact
export OMP_NUM_THREADS=xx
and don't overload your machine: OMP_NUM_THREADS * MPI threads <= cores in your hardware configuration

programing advice:
use "default(none)" in openmp, it will force you to scope all your variables as private or shared (and ask yourself the good questions for performace improvement) ==> quite good automatic debug :=) :=)

Cheers.

Alain gives excellent advice. My one reservation: Intel Thread Checker doesn't play well with default(none). If that were to be corrected, I would agree 100%.

joseph-krahn · ‎06-06-2009

Quoting - masonk

Hi there. First post. Just to disclaim this: I am a pretend programmer. I'm a biophysics researcher.
...

I wouldn't call yourself a "pretend" programmer. Most OpenMP is probably written by "non-programmers".

Is your energy() function just a simple function, or are there cut-offs that make large differences in calculation times? If it varies significantly, you may need to use the SCHEDULE() directive.

Joe