Solved: How different was your

Shayan_Y_ · ‎11-10-2015

I am trying to speed up the code in serial via TBB, I have implemented the code via tbb:parallel_reduce. But the serial one has better performance.Maybe I have done mistake that you can help me to fix it :)

/*Serial version*/

#include <time.h>
#include <iomanip>
#include <stdlib.h>
#include <iostream>
#include <tbb/tbb.h>
#include <cmath>
#include <omp.h>
#include <vector>
#include "tbb/blocked_range.h"
#include <tbb/parallel_reduce.h>
#include <limits>
int const size=1000000000;

void calculate(){    
        
	std::vector<int> array;

        array.resize(size);
	
	srand(1);
        for (int i=0; i<size; i++)
        {
                array=rand()%1000;
        }

	int min=1000000;
	int min_index=-1;
	for(int i=0;i<size;i++)
	{
		if(array<min)
		{
			min=array;
			min_index=i;
		}
	}
	std::cout<<"Min="<<min<<"\n";
	std::cout<<"Index="<<min_index<<"\n";
}

void main()
{
	double timer=0.;    

	timer-=omp_get_wtime();//Start timer
    
	calculate();

	timer+=omp_get_wtime();//End timer
    
    
	std::cout.setf(std::ios::fixed);
	std::cout<<"\nTime:"<<std::setprecision(15)<<timer<<"\n";
	
}

Parallel version:

#include <time.h>
#include <iomanip>
#include <stdlib.h>
#include <iostream>
#include <tbb/tbb.h>
#include <cmath>
#include <omp.h>
#include <vector>
#include "tbb/blocked_range.h"
#include <tbb/parallel_reduce.h>
#include <limits>
int const size=1000000000;

class Min
{
public:
int min_value;
int min_index;
std::vector<int> &my_a;

void operator() ( const tbb::blocked_range<int>& r){

for(int i=r.begin();i!=r.end();i++)
{
       if(my_a<min_value){
                min_value=my_a;
               min_index=i;
    }
}
}
Min(std::vector<int> &a):my_a(a),min_value(100000),min_index(-1){
}

Min(Min &x,tbb::split):min_value(x.min_value),min_index(x.min_index),my_a(x.my_a){

}
void join(Min& y){
if(y.min_value<min_value)
min_value=y.min_value;
min_index=y.min_index;
}
};

void calculate(){    
        
std::vector<int> array;

        array.resize(size);

srand(1);
        for (int i=0; i<size; i++)
        {
                array=rand()%1000;
        }
/*
        for (int i=0; i<size; i++)
        {
                std::cout<<array<<"\n";
        }*/

Min a(array);
/*
        for (int i=0; i<size; i++)
        {
                std::cout<<a.my_a<<"\n";
        }
*/


tbb::task_scheduler_init init(tbb::task_scheduler_init::deferred);
        init.initialize(16);

tbb::parallel_reduce(tbb::blocked_range<int>(0, size, 100000),a);

std::cout<<"Min="<<a.min_value<<"\n";
std::cout<<"Index="<<a.min_index<<"\n";
}

int main(int argc, char const *argv[])
{



double timer=0.;      
timer-=omp_get_wtime();//Start timer
    
calculate();

timer+=omp_get_wtime();//End timer
    
    
std::cout.setf(std::ios::fixed);
std::cout<<"\nTime:"<<std::setprecision(15)<<timer<<"\n";
return 0;
}

Vladimir_P_1234567890 · ‎11-11-2015

As you can see from VTune profile most of the work is to fill the vector. So it is better to time just second "for" for single thread case and parallel_reduce for parallel case.

--Vladimir

View solution in original post

Vladimir_P_1234567890 · ‎11-11-2015

As you can see from VTune profile most of the work is to fill the vector. So it is better to time just second "for" for single thread case and parallel_reduce for parallel case.

--Vladimir

jimdempseyatthecove · ‎11-14-2015

A problem you are overlooking is your dummy work load is using rand(). rand has a critical section.

I suggest you rework your test program whereby you prefill a temp array with the rand()'s prior to your timed section. Then run your same code with the replacement of call to rand() with tempRand

Jim Dempsey

Shayan_Y_ · ‎11-15-2015

I have done the changes, But know I will face a new problem. My result is not the same as my serial code. Does the rand() function thread safe?

Vladimir_P_1234567890 · ‎11-15-2015

did you pass the same buffer to serial and parallel versions?

--Vladimir

xian_x_ · ‎11-27-2015

How different was your parallel result against the serial code? My guess is the minimum value would be the same and the index was different? rand() is called in serial with the same seed, so I'd expect the same input, which might have multiple minimum values, and the parallel code ended up printing out one minimum with different index. These were just my speculations...

jimdempseyatthecove · ‎11-28-2015

While rand() is thread safe, for multi-threaded purposes it requires that rand() internally use a critical section when updating the seed (per call). When multiple threads call rand(), the sequence of the random numbers returned is based on the order of and which thread obtains the mutex to the critical section. This is generally non-deterministic and this section is run in serial. Therefor, when making test code for checking performance of serial verses parallel, it is advisable to not use a thread serializing function such as rand(), unless rand() is expressly required by the actual code running your function. Then in that case, you are advised to use a multi-threaded performance oriented (non-critical section) version of a random number generator, or pre-create a pool of random numbers that your threads can select from in a manner that is thread-safe without use of critical section or interlocked instruction sequences.

Your version of your operator() to produce min has all threads writing to min_value and min_index. I suggest you look at the parallel_reduce examples and derive your min function from those examples. IOW each thread should produce a localized min, then up-propagate the result for reduction.

Jim Dempsey

Shayan_Y_ · ‎11-30-2015

Yes I have edited my code and now I get the correct result.

Shayan_Y_ · ‎11-30-2015

jimdempseyatthecove wrote:

While rand() is thread safe, for multi-threaded purposes it requires that rand() internally use a critical section when updating the seed (per call). When multiple threads call rand(), the sequence of the random numbers returned is based on the order of and which thread obtains the mutex to the critical section. This is generally non-deterministic and this section is run in serial. Therefor, when making test code for checking performance of serial verses parallel, it is advisable to not use a thread serializing function such as rand(), unless rand() is expressly required by the actual code running your function. Then in that case, you are advised to use a multi-threaded performance oriented (non-critical section) version of a random number generator, or pre-create a pool of random numbers that your threads can select from in a manner that is thread-safe without use of critical section or interlocked instruction sequences.

Your version of your operator() to produce min has all threads writing to min_value and min_index. I suggest you look at the parallel_reduce examples and derive your min function from those examples. IOW each thread should produce a localized min, then up-propagate the result for reduction.

Jim Dempsey

Yes, I completely forget that. The value of max or min in parallel code and serial is the same but the index is different( due to non deterministic nature of the code )

RafSchietekat · ‎11-30-2015

jimdempseyatthecove wrote:

rand() is thread safe

Maybe on Windows, but not in general.

jimdempseyatthecove · ‎12-03-2015

>>Maybe on Windows, but not in general.

When writing a multi-threaded application, one should link in a multi-threaded (thread-safe) library. When using a thread-safe library, one should be assure that the "standard" API are also thread-safe. One expects malloc for example to be thread-safe, same with math library functions rand(). This is not to say rand() has to be made thread-safe with locks. It just has to be thread-safe (for the multi-threaded library).

Jim Dempsey

RafSchietekat · ‎12-03-2015

Maybe on Windows, but not in general (as I already said).

(2015-12-04 Added) Apparently also on Linux (?), but then I would still at least segregate this with #ifdef/#else/#error. The most recent claim I found that FreeBSD and OS X do not implement thread-safe rand() was from 2011 (not sure, can't find it again), and the current man page in OS X still does not assure thread-safety; the designated thread-safe alternative is rand_r(). Besides, it's a rather poor source of random numbers and seems to be deprecated, with a wealth of alternatives to choose from.

(2015-12-05 Added) The claim from 2011 about FreeBSD and OS X mentioned above can be found here. It also claims that Linux acquires a mutex, but I'd rather see the signed contract on that, and the first thing Google just found for me about [linux rand] is a man page stating what looks like the opposite ("The function rand() is not reentrant or thread-safe, since it uses hidden state that is modified on each call."), although it muddles that by saying "In order to get reproducible behavior in a threaded application", which would also apply to an implementation protected by a mutex, not just to avoid, e.g., sometimes getting the same return value in different threads. Still, reason enough for me to stay away from rand() just for its thread-safety issues.

jimdempseyatthecove · ‎12-05-2015

Then someone might ask: What else is not thread-safe in a supposedly thread-safe library? Is there an official list? From your contradictory references it appears not.

Jim Dempsey

RafSchietekat · ‎12-05-2015

~~printf().~~ Documentation (still!) seems to be spotty, so I try not to make assumptions.

(Correction) Apparently printf() is or should be safe after all.

(Added) There are other functions as well with both non-reentrant and reentrant versions, although I haven't found any others yet where the non-reentrant version could be made thread-safe by just adding a mutex (with rand() you would only lose reproducibility and some performance, not functionality). Here's a list for ARM.

(Added) This list for POSIX, apparently last updated in 2013 (less than three years ago), seems to be what you were asking for. It contains rand(). But also system(), which has even less to lose than rand().

(Added) Some environments offer printf_unlocked() etc. for better performance.

(Added) And here's a higher-level overview, differentiating different causes.

Serial Vs. Parall