Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.

parallel_reduce strange behavior

ahmedg
Beginner
210 Views
Hello every one,

I was testing very basic examples from Intel TBB book, and the behavior is very strange.

I ran the following code (which do just parallel multiply, summing andminimum).

The problems:
---------------
1) The parallel_reduce in the summing on my laptop (2 core duo 2.4 GHz), when I specify to (task_scheduler_init::deferred) andinit.initialize(1) ) it is faster than the automatic which uses the two cores!!!
2) I ran the same code on my PC (Xeon 8 cores 3.0 GHz),
a) the summing result is different!!!!
b) on 8 cores it's much much slower than on 2 cores!! either with automatic or simple or manual grain size !!
Laptop -> linux ubuntu, 32 bits, tbb 2.2
PC -> fedora, 64 bits, tbb 2.1
How comes?
Many thanks for your help.
#include "tbb/task_scheduler_init.h"
#include "tbb/blocked_range.h"
#include "tbb/parallel_for.h"
#include "tbb/parallel_reduce.h"
#include
#include
using namespace std;
using namespace tbb;
const int dim = 100000000;
void Foo( double *x , double *y , double *z );
void Foo( double *x , double *y , double *z )
{*z = *x * *y * sqrt(*x) * sqrt(*y) * log10(*x * *y) / 3 ; // any function
}
class VectorMultiply
{
private:
double *const a , *const b , *const c ;
public:
VectorMultiply( double *x , double *y , double *z ) : a(x) , b(y) , c(z)
{}
void operator()( const blocked_range& r ) const
{
double *m , *n , *l ;
m = a ; n = b ; l = c ;
for( size_t i=r.begin(); i!=r.end(); ++i )
{
// l = m * n * sqrt(m) * sqrt(n) * log10(m*n) * m * n * sqrt(m) * sqrt(n) * log10(m*n) / 3 ;
// Foo( &a , &b , &c) ;
c = a + b ;
// c = a * b * sqrt(a) * sqrt(b) * log10(a * b) / 3 ;
// c = a * b * sqrt(a) * sqrt(b) * log10(a*b) * a * b * sqrt(a) * sqrt(b) * log10(a*b) / 3 ;
}
}
};
class FindSum
{
private:
double * my_a ;
public:
double sum ;
FindSum() : my_a(NULL) , sum(0)
{}
FindSum( double * x ) : my_a(x) , sum(0)
{}
void operator()( const blocked_range& r )
{
for ( size_t i = 0 ; i != r.end() ; i++ )
sum += my_a ;
}
FindSum( FindSum& x , split ) : my_a(x.my_a) , sum(0)
{}
void join( const FindSum& y )
{
sum += y.sum ;
}
double result()
{
return sum ;
}
~FindSum()
{}
};
class FindMin
{
private:
const double * const my_a ;
public:
double value_of_min ;
long index ;
FindMin( const double * a ) : my_a(a) , value_of_min(100000000) , index(-1)
{}
FindMin( FindMin& y , split) : my_a(y.my_a) , value_of_min(100000000) , index(-1)
{}
void join( const FindMin& y )
{
if ( y.value_of_min < value_of_min )
{
value_of_min = y.value_of_min ;
index = y.index ;
}
}
void operator()( const blocked_range& range )
{
const double *a = my_a ;
double value;
for ( size_t i = range.begin() ; i != range.end() ; i++ )
{
value = a ;
if ( value < value_of_min )
{
value_of_min = value ;
index = i ;
}
}
}
double result()
{
return value_of_min ;
}
};
class Vector
{
private:
double *a , *b , *c ;
double min ;
public:
Vector()
{
a = new double [dim] ;
b = new double [dim] ;
c = new double [dim] ;
}
Vector( double *x , double *y , double *z ) : a(x) , b(y) , c(z)
{}
void Fooo()
{
for ( int i = 0 ; i < dim ; i++ )
a = i ;
for ( int i = 0 ; i < dim ; i++ )
b = i ;
}
void multiply()
{
parallel_for( blocked_range(0,dim,1000) , VectorMultiply(a,b,c) ) ;
// parallel_for( blocked_range(0,dim) , VectorMultiply(a,b,c) , auto_partitioner() ) ;
}
void sum()
{
cout<<<"Start summing"<
FindSum temp(c) ;
parallel_reduce( blocked_range(0,dim,1000) , temp ) ;
cout<<<"Sum -> "<< temp.result() ;
}
void minimum()
{
cout<<<"Start searching for the minimum value"<
FindMin mini(c) ;
parallel_reduce( blocked_range(0,dim,1000) , mini ) ;
cout<<<"Min -> "<< mini.result() ;
}
void print()
{
for ( int i = 0 ; i < 20 ; i++ )
cout<<<",";
}
};
int main()
{
task_scheduler_init init ; //(task_scheduler_init::deferred) ;
// init.initialize(1) ;
Vector a ;
a.Fooo() ;
a.multiply() ;
a.print() ;
a.sum() ;
a.minimum() ;
return 0 ;
}
0 Kudos
1 Reply
ARCH_R_Intel
Employee
210 Views
Tip for this forum: before pasting code, click on the "yellow highlighter" icon. It will bring up a window that you can paste the code into and not lose indentation. Don't forget to select "C++" before pasting the code.

The problem in the code is a tiny error in class FindSum. Here is the correction:

    for ( size_t i = 0 r.begin() ; i != r.end() ; i++ )

By the way, a little more performance might be obtained by using a local temporary variable to accumulate the sum, like this:

[cpp]    double tmp = sum;
    for ( size_t i = r.begin(); i != r.end() ; i++ )
        tmp += my_a;
    sum = tmp;[/cpp]

The reason it might help is that sometimes a compiler cannot tell that my_a and this->sum are not aliases for the same location, and thus has to be conservative about optimization. By accumulating the sum in a non-address-taken local temporary, you make clear that the location being updated inside the loop is not aliased to my_a.

0 Kudos
Reply