Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.
Announcements
The Intel sign-in experience has changed to support enhanced security controls. If you sign in, click here for more information.
2449 Discussions

Parallel_Scan taking more time than serial

Singh_Jasdeep
Beginner
285 Views

I am executing code with the help of Parallel_Scan and through serially .With Serial its actually faster than using Parallel_Scan.

Code which i am using is:

#include <iostream>
#include <stdlib.h>
#include <time.h>
#include "tbb/task_scheduler_init.h"
#include "tbb/blocked_range.h"
#include "tbb/parallel_scan.h"
#include "tbb/tick_count.h"
#include "tbb/compat/thread"
using namespace std;
using namespace tbb;

template <class T>
class Body
{
T reduced_result;
T* const y;
const T* const x;

public:

Body( T y_[], const T x_[] ) : reduced_result(0), x(x_), y(y_) {}

T get_reduced_result() const {return reduced_result;}

template<typename Tag>
void operator()( const blocked_range<int>& r, Tag )
{
T temp = reduced_result;
//cout<<"id of thread is \t"<<this_thread::get_id()<<endl;
for( int i=r.begin(); i<r.end(); ++i )
{
temp = temp+x;
if( Tag::is_final_scan() )
{
y = temp;
//cout<<i<<","<<y<<endl;

}

}
reduced_result = temp;

}

Body( Body& b, split ) : x(b.x), y(b.y), reduced_result(0)
{
cout<< " output of split is is \t " << endl;
}

void reverse_join( Body& a )
{
reduced_result = a.reduced_result + reduced_result;
// cout<< " output of reduced_result now is " << reduced_result << endl;
}

void assign( Body& b )
{
reduced_result = b.reduced_result;
// cout<<"final value assigned"<<endl;
}
};


template<class T>
float DoParallelScan( T y[], const T x[], int n)
{
Body<int> body(y,x);
tick_count t1,t2,t3,t4;
t1=tick_count::now();
parallel_scan( blocked_range<int>(0,n), body , auto_partitioner() );
t2=tick_count::now();
cout<<"Time Taken for parallel scan is \t"<<(t2-t1).seconds()<<endl;
return body.get_reduced_result();
}


template<class T1>
float SerialScan(T1 y[], const T1 x[], int n)
{
tick_count t3,t4;

t3=tick_count::now();
T1 temp = 0;

for( int i=0; i<n; ++i )
{
// cout<<"id of thread is \t"<<this_thread::get_id()<<endl;
temp = temp+x;
y = temp;
}
t4=tick_count::now();
cout<<"Time Taken for serial scan is \t"<<(t4-t3).seconds()<<endl;
return temp;

}


int main()
{
task_scheduler_init init1(4);

int y1[1000],x1[1000];

for(int i=0;i<1000;i++)
x1=i+1;

cout<<fixed;

cout<<"\n serial scan output is \t"<<SerialScan(y1,x1,1000)<<endl;

cout<<"\n parallel scan output is \t"<<DoParallelScan(y1,x1,1000)<<endl;

return 0;
}

Please help to find where i am getting wrong.

0 Kudos
14 Replies
RafSchietekat
Black Belt
285 Views

Try different grainsize values (blocked_range parameter, 1 by default, works with auto_partitioner as well as simple_partitioner)?

(Added) Do you see any difference if you don't evaluate blocked_range::end() inside the loop?

Singh_Jasdeep
Beginner
285 Views

I have tried with different grain sizes , with serial it takes only 3 usec and with parallel it is taking a minimum of 703 usec.Please check whether my coding style is correct so that we can find where something is getting wrong.

RafSchietekat
Black Belt
285 Views

The main issue here is problem size: try again with a lot more data, but don't get your hopes up too far because memory bandwidth might be a bottleneck.

Singh_Jasdeep
Beginner
285 Views

Thanks for replying.

i have also increased the problem size but the results are same as before , for serial it becomes like 0.6 sec and for parallel its 4.2 sec.i am stucked , i have this type of algorithm and wants to implement parallel_scan in that but its not proving beneficial.If you have any better code where you have checked its performance it will be really very helpful.

RafSchietekat
Black Belt
285 Views

Did you remove end() and if() from the loop? Yes, that would mean two separate loops.

Singh_Jasdeep
Beginner
285 Views

i have removed if from the loop it worked and timings reduced now to almost 2 sec .can u pls tell how to remove end because if i am not using end how to calculate inside loop .if i m not wrong r.end u are talking about.Thanks .

RafSchietekat
Black Belt
285 Views

[cpp]

for( int i=r.begin(), end = r.end(); i != end; ++i )

[/cpp]

(Corrected.)

Singh_Jasdeep
Beginner
285 Views

when i am replacing the for used with this , at run time it throws exception and terminate.

Singh_Jasdeep
Beginner
285 Views

it gives exception " Assertion h!=small_local_task || p.origin ==this failed on line 617 of file z:\itt\branchtbb41\tbb\1.01src\tbb\scheduler.h  " 

Singh_Jasdeep
Beginner
285 Views

And the for loop which u have given in that i will never be equal to end so it has raised exception .Can u Pls tell some alternative to this.

RafSchietekat
Black Belt
285 Views

Sorry, I was on my way out and in a hurry when I wrote that code. You should now be able to see the corrected version.

Singh_Jasdeep
Beginner
285 Views

No , still its the same code which you have written earlier.

RafSchietekat
Black Belt
285 Views

Please check again: "for( int i=r.begin(); i<r.end(); ++i )" (your version) -> "for( int i=r.begin(), end=i<r.end(); i!=end; ++i )" (my earlier mistake) -> "for( int i=r.begin(), end=r.end(); i!=end; ++i )" (what it should be). (You can keep "<" instead of "!=" if you prefer.)

Singh_Jasdeep
Beginner
285 Views

OOPS srry i missed .....now i have checked ...its correct and it worked also ,,,,,,,,,,,,,we have finaally acheived a speedup of 2X.Thanks :)

Reply