Possible opportunities within open parallel for lop

Londhe__Ashutosh · ‎06-03-2015

Dear all,

i wanted know what are the different technique that i can use within already parallel openmp for loop to gain performance.

Like i am working on one code(snippet shown below) in which there is a for loop(already parallelized) calling one function for which intel vtune is showing hotspots, so how i can reduce execution time for that function, like can i use #pragma simd or it will make slower.

also there is another for loop inside already parallelized openmp for loop which also has hotspots.

Please let me know possible things i can do to achieve performance for both above mentioned problems.

#pragma omp parallel

{

#pragma omp for

for(;;)

{

some line of statements and function calls;

for(;;)//hotspot mentioned in second question

{

some line of statements;

}

function call;// hotspot mentioned in first question

some line of statements;

}

}//End of pragma omp parallel

TimP · ‎06-04-2015

Achieving vectorization of an inner loop in a parallel region is certainly a likely way to optimize. You haven't begun to furnish information to guess whether it will succeed.

Londhe__Ashutosh · ‎06-04-2015

@Tim Prince

Sir,

I tried using #pragma simd to achieve vectorization on this time consuming for loops inside openmp parallel region but the execution time is increased instead of decreasing, that's why i wanted to know any other things that i can try to optimize this piece of code.

I am sharing vtune result that i got for both the cases that is without simd and with simd.

370 secs. is total execution time without simd and 669 secs. is with simd.

jimdempseyatthecove · ‎06-08-2015

You will have to disclose more of the actual code. You only need to supply the control loops inclusive of the #pragma omp.... Also include calls to any functions have have serializing effects/side effects.(random number generator, barrier, critical section, mutex, etc...).

Jim Dempsey

Londhe__Ashutosh · ‎06-08-2015

Dear sir,

please refer code sample below,

After implementation of this openmp i got speed up of 9..49x(time reduced from 240 min to 25 min) and after using O3, xAVX and KMP_AFFINITY=compact i got final speed up of 33.81x (execution time : 7.1 min). After all this i have done vtune which showing upto 40% potentiol gain, thats why i am trying to achieve this gain with simd but doesn't getting any good result.

#pragma omp parallel shared(shared var list) private(private var list)
{

#pragma omp for schedule(dynamic, chunk)
for(;;)   //main for
{
       get_Grid_Velocities();   //function call

       for(;;)   //2nd for
       {
            Statements;

            if(condition) // main if condition
            {
                Statement;
                if(condition)
                {
                       Statements;
                }
                else
                {
                      if(condition)
                      {
                           Statements;
                      }
                      else
                      {
                           Statements;
                      }
                 } //End of if-else

                 Statements;

                 for(;;)
                 {
                      Statements;
                 }

                 Statements;  
                 if(condition)
                 {
                      Statements with function call;
                 }
                 else
                 {
                      Statements with function call;
                 }

                Statements;

                get_NDT_Depth();  //function call

                get_NDT_Depth();  //function call

                //Tried pragma simd here but no use 
                for(;;)      //Hotspot
                {
                       Statement;

                       if(condition)
                       {
                            Statement;
                       }
                       else
                       {
                            Statement;
                       }
                       Statements;

                       if(condition)
                       {
                            Statement;
                       }
                       else
                       {
                             Statement;
                       }

                       Statements;
                       if(condition)
                       {
                           Statement;                                
                       }
              }//Hotspot for loop ends

              trace_Interp();   //Function call - hotspot

              #pragma simd
              for(;;)
              {
                   Statement;     
              }

         } //main if condition
    } //2nd for loop

         if(condition)
         {
              for(;;)
              {
               
              }
         }    
         else
         {
            for(;;)
              {
               
              }
         }

         for(;;)
         {
              Statement; 
              for(;;)
              {
                 Statements;
              }
         }
         
}///main for loop
} //end of omp shared


TraceInterp function:
void trace_Interp()
{
        Statements;

        for(condition)
        { 
                if()
                {
                }
                else
                {
                }
        }

        //Tried pragma simd here but no use
        for(;;)  //Hotspot
        {
                
                tr_out = tr_in[int_samp_out-2]*inp_sp->coeff_sync[indx_sync][0] \
                                + tr_in[int_samp_out-1]*inp_sp->coeff_sync[indx_sync][1] \
                                + tr_in[int_samp_out]*inp_sp->coeff_sync[indx_sync][2] \
                                + tr_in[int_samp_out+1]*inp_sp->coeff_sync[indx_sync][3] \
                                + tr_in[int_samp_out+2]*inp_sp->coeff_sync[indx_sync][4] 
                                + tr_in[int_samp_out+3]*inp_sp->coeff_sync[indx_sync][5] \
                                + tr_in[int_samp_out+4]*inp_sp->coeff_sync[indx_sync][6] \
                                + tr_in[int_samp_out+5]*inp_sp->coeff_sync[indx_sync][7];
        }
}//End of trace_Interp function

jimdempseyatthecove · ‎06-09-2015

Your for loops with branches will not SIMDize unless the conditional statements are simple assignments similar to

if(A < B)
B = A * k;
else
B = C - A;

Note, the statements contain no division and short enough that computing both paths in vector with mask move is more efficient than computing one path in scalar. The arrays must have contiguous (and aligned) access too.

Your last for(;;) loop has what amounts to be a horizontal add. Try finessing the compiler to perform the multiplication in vector from memory fetches and the horizontal add from register:

{
  _declspec(align(64)) T temp[8]; // don't know what type your arrays are
  T* p_tr_in = &tr_in[int_samp_out-2];
  T* p_coeff_sync = &inp_sp->coeff_sync[indx_sync];
  for(int i = 0; i < 8; ++i)  //Hotspot
  {
    temp = p_tr_in*ip_coeff_sync;
  }
  for(;;)  //Hotspot
  {
    tr_out = temp[0] \
              + temp[1] \
              + temp[2] \
              + temp[3] \
              + temp[4] \
              + temp[5] \
              + temp[6] \
              + temp[7];
  }
}

If that gives marginal improvement, then see if you can extend the temp array size and product production.

Jim Dempsey