- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear all,
i wanted know what are the different technique that i can use within already parallel openmp for loop to gain performance.
Like i am working on one code(snippet shown below) in which there is a for loop(already parallelized) calling one function for which intel vtune is showing hotspots, so how i can reduce execution time for that function, like can i use #pragma simd or it will make slower.
also there is another for loop inside already parallelized openmp for loop which also has hotspots.
Please let me know possible things i can do to achieve performance for both above mentioned problems.
#pragma omp parallel
{
#pragma omp for
for(;;)
{
some line of statements and function calls;
for(;;)//hotspot mentioned in second question
{
some line of statements;
}
function call;// hotspot mentioned in first question
some line of statements;
}
}//End of pragma omp parallel
- Tags:
- Parallel Computing
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Achieving vectorization of an inner loop in a parallel region is certainly a likely way to optimize. You haven't begun to furnish information to guess whether it will succeed.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Tim Prince
Sir,
I tried using #pragma simd to achieve vectorization on this time consuming for loops inside openmp parallel region but the execution time is increased instead of decreasing, that's why i wanted to know any other things that i can try to optimize this piece of code.
I am sharing vtune result that i got for both the cases that is without simd and with simd.
370 secs. is total execution time without simd and 669 secs. is with simd.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You will have to disclose more of the actual code. You only need to supply the control loops inclusive of the #pragma omp.... Also include calls to any functions have have serializing effects/side effects.(random number generator, barrier, critical section, mutex, etc...).
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear sir,
please refer code sample below,
After implementation of this openmp i got speed up of 9..49x(time reduced from 240 min to 25 min) and after using O3, xAVX and KMP_AFFINITY=compact i got final speed up of 33.81x (execution time : 7.1 min). After all this i have done vtune which showing upto 40% potentiol gain, thats why i am trying to achieve this gain with simd but doesn't getting any good result.
#pragma omp parallel shared(shared var list) private(private var list) { #pragma omp for schedule(dynamic, chunk) for(;;) //main for { get_Grid_Velocities(); //function call for(;;) //2nd for { Statements; if(condition) // main if condition { Statement; if(condition) { Statements; } else { if(condition) { Statements; } else { Statements; } } //End of if-else Statements; for(;;) { Statements; } Statements; if(condition) { Statements with function call; } else { Statements with function call; } Statements; get_NDT_Depth(); //function call get_NDT_Depth(); //function call //Tried pragma simd here but no use for(;;) //Hotspot { Statement; if(condition) { Statement; } else { Statement; } Statements; if(condition) { Statement; } else { Statement; } Statements; if(condition) { Statement; } }//Hotspot for loop ends trace_Interp(); //Function call - hotspot #pragma simd for(;;) { Statement; } } //main if condition } //2nd for loop if(condition) { for(;;) { } } else { for(;;) { } } for(;;) { Statement; for(;;) { Statements; } } }///main for loop } //end of omp shared TraceInterp function: void trace_Interp() { Statements; for(condition) { if() { } else { } } //Tried pragma simd here but no use for(;;) //Hotspot { tr_out = tr_in[int_samp_out-2]*inp_sp->coeff_sync[indx_sync][0] \ + tr_in[int_samp_out-1]*inp_sp->coeff_sync[indx_sync][1] \ + tr_in[int_samp_out]*inp_sp->coeff_sync[indx_sync][2] \ + tr_in[int_samp_out+1]*inp_sp->coeff_sync[indx_sync][3] \ + tr_in[int_samp_out+2]*inp_sp->coeff_sync[indx_sync][4] + tr_in[int_samp_out+3]*inp_sp->coeff_sync[indx_sync][5] \ + tr_in[int_samp_out+4]*inp_sp->coeff_sync[indx_sync][6] \ + tr_in[int_samp_out+5]*inp_sp->coeff_sync[indx_sync][7]; } }//End of trace_Interp function
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Your for loops with branches will not SIMDize unless the conditional statements are simple assignments similar to
if(A < B)
B = A * k;
else
B = C - A;
Note, the statements contain no division and short enough that computing both paths in vector with mask move is more efficient than computing one path in scalar. The arrays must have contiguous (and aligned) access too.
Your last for(;;) loop has what amounts to be a horizontal add. Try finessing the compiler to perform the multiplication in vector from memory fetches and the horizontal add from register:
{ _declspec(align(64)) T temp[8]; // don't know what type your arrays are T* p_tr_in = &tr_in[int_samp_out-2]; T* p_coeff_sync = &inp_sp->coeff_sync[indx_sync]; for(int i = 0; i < 8; ++i) //Hotspot { temp = p_tr_in*ip_coeff_sync; } for(;;) //Hotspot { tr_out = temp[0] \ + temp[1] \ + temp[2] \ + temp[3] \ + temp[4] \ + temp[5] \ + temp[6] \ + temp[7]; } }
If that gives marginal improvement, then see if you can extend the temp array size and product production.
Jim Dempsey
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page