- Отметить как новое
- Закладка
- Подписаться
- Отключить
- Подписка на RSS-канал
- Выделить
- Печать
- Сообщить о недопустимом содержимом
Dear all,
i wanted know what are the different technique that i can use within already parallel openmp for loop to gain performance.
Like i am working on one code(snippet shown below) in which there is a for loop(already parallelized) calling one function for which intel vtune is showing hotspots, so how i can reduce execution time for that function, like can i use #pragma simd or it will make slower.
also there is another for loop inside already parallelized openmp for loop which also has hotspots.
Please let me know possible things i can do to achieve performance for both above mentioned problems.
#pragma omp parallel
{
#pragma omp for
for(;;)
{
some line of statements and function calls;
for(;;)//hotspot mentioned in second question
{
some line of statements;
}
function call;// hotspot mentioned in first question
some line of statements;
}
}//End of pragma omp parallel
- Теги:
- Parallel Computing
Ссылка скопирована
- Отметить как новое
- Закладка
- Подписаться
- Отключить
- Подписка на RSS-канал
- Выделить
- Печать
- Сообщить о недопустимом содержимом
Achieving vectorization of an inner loop in a parallel region is certainly a likely way to optimize. You haven't begun to furnish information to guess whether it will succeed.
- Отметить как новое
- Закладка
- Подписаться
- Отключить
- Подписка на RSS-канал
- Выделить
- Печать
- Сообщить о недопустимом содержимом
@Tim Prince
Sir,
I tried using #pragma simd to achieve vectorization on this time consuming for loops inside openmp parallel region but the execution time is increased instead of decreasing, that's why i wanted to know any other things that i can try to optimize this piece of code.
I am sharing vtune result that i got for both the cases that is without simd and with simd.
370 secs. is total execution time without simd and 669 secs. is with simd.
- Отметить как новое
- Закладка
- Подписаться
- Отключить
- Подписка на RSS-канал
- Выделить
- Печать
- Сообщить о недопустимом содержимом
You will have to disclose more of the actual code. You only need to supply the control loops inclusive of the #pragma omp.... Also include calls to any functions have have serializing effects/side effects.(random number generator, barrier, critical section, mutex, etc...).
Jim Dempsey
- Отметить как новое
- Закладка
- Подписаться
- Отключить
- Подписка на RSS-канал
- Выделить
- Печать
- Сообщить о недопустимом содержимом
Dear sir,
please refer code sample below,
After implementation of this openmp i got speed up of 9..49x(time reduced from 240 min to 25 min) and after using O3, xAVX and KMP_AFFINITY=compact i got final speed up of 33.81x (execution time : 7.1 min). After all this i have done vtune which showing upto 40% potentiol gain, thats why i am trying to achieve this gain with simd but doesn't getting any good result.
#pragma omp parallel shared(shared var list) private(private var list) { #pragma omp for schedule(dynamic, chunk) for(;;) //main for { get_Grid_Velocities(); //function call for(;;) //2nd for { Statements; if(condition) // main if condition { Statement; if(condition) { Statements; } else { if(condition) { Statements; } else { Statements; } } //End of if-else Statements; for(;;) { Statements; } Statements; if(condition) { Statements with function call; } else { Statements with function call; } Statements; get_NDT_Depth(); //function call get_NDT_Depth(); //function call //Tried pragma simd here but no use for(;;) //Hotspot { Statement; if(condition) { Statement; } else { Statement; } Statements; if(condition) { Statement; } else { Statement; } Statements; if(condition) { Statement; } }//Hotspot for loop ends trace_Interp(); //Function call - hotspot #pragma simd for(;;) { Statement; } } //main if condition } //2nd for loop if(condition) { for(;;) { } } else { for(;;) { } } for(;;) { Statement; for(;;) { Statements; } } }///main for loop } //end of omp shared TraceInterp function: void trace_Interp() { Statements; for(condition) { if() { } else { } } //Tried pragma simd here but no use for(;;) //Hotspot { tr_out = tr_in[int_samp_out-2]*inp_sp->coeff_sync[indx_sync][0] \ + tr_in[int_samp_out-1]*inp_sp->coeff_sync[indx_sync][1] \ + tr_in[int_samp_out]*inp_sp->coeff_sync[indx_sync][2] \ + tr_in[int_samp_out+1]*inp_sp->coeff_sync[indx_sync][3] \ + tr_in[int_samp_out+2]*inp_sp->coeff_sync[indx_sync][4] + tr_in[int_samp_out+3]*inp_sp->coeff_sync[indx_sync][5] \ + tr_in[int_samp_out+4]*inp_sp->coeff_sync[indx_sync][6] \ + tr_in[int_samp_out+5]*inp_sp->coeff_sync[indx_sync][7]; } }//End of trace_Interp function
- Отметить как новое
- Закладка
- Подписаться
- Отключить
- Подписка на RSS-канал
- Выделить
- Печать
- Сообщить о недопустимом содержимом
Your for loops with branches will not SIMDize unless the conditional statements are simple assignments similar to
if(A < B)
B = A * k;
else
B = C - A;
Note, the statements contain no division and short enough that computing both paths in vector with mask move is more efficient than computing one path in scalar. The arrays must have contiguous (and aligned) access too.
Your last for(;;) loop has what amounts to be a horizontal add. Try finessing the compiler to perform the multiplication in vector from memory fetches and the horizontal add from register:
{ _declspec(align(64)) T temp[8]; // don't know what type your arrays are T* p_tr_in = &tr_in[int_samp_out-2]; T* p_coeff_sync = &inp_sp->coeff_sync[indx_sync]; for(int i = 0; i < 8; ++i) //Hotspot { temp = p_tr_in*ip_coeff_sync; } for(;;) //Hotspot { tr_out = temp[0] \ + temp[1] \ + temp[2] \ + temp[3] \ + temp[4] \ + temp[5] \ + temp[6] \ + temp[7]; } }
If that gives marginal improvement, then see if you can extend the temp array size and product production.
Jim Dempsey

- Подписка на RSS-канал
- Отметить тему как новую
- Отметить тему как прочитанную
- Выполнить отслеживание данной Тема для текущего пользователя
- Закладка
- Подписаться
- Страница в формате печати