parallel_for:blocked_range 1D or 2D

akhal · ‎07-01-2011

Hej
I am kinda newbie with Intel TBB and trying out parallelizing a problem which worked well with OpenMP but doesnt show speed up with TBB though looping are independent. I thought maybe 2D blocked_range might help, though it shows speedup but wrong results of calculation. My codes are as follows:


/*-----Serial Version-----*/
for(k=0; k {
 for(i=k+1; i {
 s = s/s;
 for(j=k+1; j s -= s*s;
 }
 }

/*OpenMP version (which shows considerable speedup) */ 
#pragma omp parallel default(shared) private(k)
 for(k=0; k {
#pragma omp for private(i,j) schedule(static)
 for(i=k+1; i {
 a = a/a;
 for(j=k+1; j a1 = a1 - a1*a1;
 }
 }

/* TBB version (1D blocked_range) */
task_scheduler_init TBBinit(nthreads); 
for(int k=0; k parallel_for(blocked_range(k, size, (size-k)/nthreads), my_class(a2));
/* setting grainsize to that values reduced time but still its multiple of serial exection time:( */

class my_class
{
double** my_a; 

 
public:
 my_class(a[size][size]):my_a(a){}
 
 void operator() (const blocked_range& r) const
 {

double** a2 = my_a;
 int k = r.begin(); 
 for(int i=k+1; i!=size; i++)
 {
 a2 = a2/a2; 
 for(j=k+1; j!=size; j++)
 a2 = a2 - a2*a2;
 }  
 }
}; //This 1-D gives so poor performance

/*----- I tried 2-D range as follows-------*/
for(int k=0; k parallel_for(blocked_range2d(k,size,(size-k)/nthreads,k,size,(size-k)/nthreads), my_class2d(a3));
//Class body
class my_class2d

{

double** my_a; 

 

public:

 
my_class2d(a[size][size]):my_a(a){}

 

 void operator() (const blocked_range2d& r) const

 {


double** a3 = my_a;

 int k = r.rows().begin();
 int end = r.rows().end(); //or r.cols().end()  

 for(int i=k+1; i!=end; i++)

 {

 a3 = a3/a3; 

 for(j=k+1; j!=end; j++)

 a3 = a3 - a3*a3;

 }  

 }

};
//But this 2D attempt gives wrong results

Is this structure even parallelizable with TBB, if yes then with 1D range or with 2D range, because my 1D range example gives correct results but its too far slow than even serial, and 2D is fast but wrong results. Any help?

Kirill_R_Intel · ‎07-06-2011

Akhal,

Could you please provide complete code sample, with input data and expected results. I'll run and analyze it on my side.

Regards,
Kirill

akhal · ‎07-07-2011

I have put my code in that private post now, I would have attached my code file but I dont know how to attach it so I pasted my whole code. This is very simple code but long because I am trying multiple versions of the same TBB code. I have put my problems as comments in TBB functor classes. You can run the code, and also see the TBB functor classes (with comments) that what I did wrong. Thanks anyway

akhal · ‎07-08-2011

I have already tried this tbb::parallel_for but in this example its working weirdly, I cant use r.end() or r.rows().end() & r.cols().end() and so I have to pass another parameter for loop ending; this is also strange for me as it works in other cases. Also speedup of the example is of prime importance and I have put lot of time in it but couldnt figure out the problem of why I cant get speedup...

akhal · ‎07-11-2011

Anybody there??

robert-reed · ‎07-11-2011

Quoting akhal

Anybody there??

Hello. I haven't been able to pay much attention to the Intel TBB forum recently, my life filled with other tasks. But I couldn't miss your plea for help, and decided to take a look. That code is definitely newbie.

[cpp]/* TBB version (1D blocked_range) */
task_scheduler_init TBBinit(nthreads);                                                                   
for(int k=0; k(k, size, (size-k)/nthreads),  my_class(a2));
/* setting grainsize to that values reduced time but still its multiple of serial exection time:( */

class my_class
{
double** my_a;                                                                                                                                 
public:
  my_class(a[size][size]):my_a(a){}                                                                                                              
  void operator() (const blocked_range& r) const
  {
    double** a2 = my_a;
    int k = r.begin();                                                                                                         
    for(int i=k+1; i!=size; i++)
      {
           a2 = a2/a2;  
           for(j=k+1; j!=size; j++)
                a2 = a2 - a2*a2;
       }       
  }
}; //This 1-D gives so poor performance

[/cpp]

I see this code has taken a novel approach for setting grain size. Normal policy is to avoid setting a specific number of threads in case you run on a machine with more HW threads available and thus artificially limit performance. (There is a task_scheduler_init method to return the number of threads.)

But that's just a side, the main issue is the function operator that partitions the range of k in the call but not in the function. The function should at least have:

[cpp]    int k = r.begin();
    int ke = r.end();                                                                                                         
    for(int i=k+1; i!=ke i++)
      {
[/cpp]

Otherwise, the first pool thread starts at k and goes to size, then the second pool thread starts at some number larger than k and goes to size, and so on. And in an instant the code has multiple threads working on the same values of k, interfering with each other to guarantee the wrong result and take a lot of time to do it. I can't even imagine how one would employ a blocked_range_2d on this problem, and since I need to get back to other pressing work, I'll refrain from taking the opportunity now. :-)

akhal · ‎07-12-2011

Even if I dont use blocked_range2d (which I did since I didnt get any performance plus in case of blocked_range); so in blocked_range, I have written as comments that if I use r.begin() and r.end() (and not size as limiting loops) then I get wrong results. So infact its working the other way round for me that what you suggest, its giving correct (but slow) results with using "size" and wrong results (and even memory access voilation errors) in case of using r.end() for limiting loops in function operator body... Thats weird for me too

robert-reed · ‎07-12-2011

I don't know what correct results represent in your algorithm (looks like some kind of matrix factorization/diagonalization code?) but I can assure you that partitioning a TBB parallel_for without limiting the upper bound in the the kernel will generate incorrect results--it will not produce the same results represented by the serial implementation. It can't because it will be reexecuting portions of the inner loops rather than just spreading the work among the threads.

Looking more closely at the serial implementation, there doesn't seem to be any forward references that should require the elaborate overcomputation of the Intel TBB sample, but even the OpenMP code doesn't look right:

[cpp]/*OpenMP version (which shows considerable speedup)   */                                                                                              
  #pragma omp parallel default(shared) private(k)
  for(k=0; k
Here I see an OpenMP parallel region created outside the outer loop, but with no pragma (like single or master) to limit those threads from allexecuting the outer loop. This makes me skeptical that this represents a correct parallelization of the serial code.