Get start

Shayan_Y_ · ‎09-02-2015

Hi

I am newbie with Linux(I have Ubuntu 14.04) and TBB. I want to write codes with TBB library and run on computer. I have no idea how should I start .

Can anyone help me with a step by step tutorial or something that help me?

Vladimir_P_1234567890 · ‎09-02-2015

Tutorial

https://www.threadingbuildingblocks.org/intel-tbb-tutorial

--Vladimir

Shayan_Y_ · ‎09-02-2015

Thanks.

My first problem is how should I be sure I have TBB on my system ? I have installed parallel studio xe 2016 and in it's directory and in my terminal run this command "source iccvars.sh intel64".

My first example is

#include <tbb/tbb.h>
#include <iostream>
#include <omp.h>

using namespace std;
using namespace tbb;
class Average {
	public:
		float* input;
		    float* output;
	void operator( )( const blocked_range<int>& range ) const {
		for( int i=range.begin(); i!=range.end( ); ++i )
      		{
		        output = (input[i-1]+input+input[i+1])/3.0f;
		}
    	}
};
void ParallelAverage( float* output, float* input, size_t n ) {
	Average avg;
	avg.input = input;
	avg.output = output;
	output[0]=(input[0]+input[1])/2.0f;
	parallel_for( blocked_range<int>( 1, n-1, 100 ), avg );
	output[n-1]=(input[n-1]+input[n-2])/2.0f;
}
void SerialAverage(float* output,float* input,int  n){
	output[0] = (input[0]+input[1])/2.0f;
	for(long int i=1; i<n-1; ++i )
      	{
		output = (input[i-1]+input+input[i+1])/3.0f;
      	}
  	output[n-1] = (input[n-1]+input[n-2])/2.0f;
}
int main()
{
	task_scheduler_init init;
	srand ( time(NULL) );
	unsigned long long int lenght=1000000;
	float input[lenght];
	float output[lenght],temp[lenght];
	double serial_timer = 0.;
	double compute_timer = 0.;
	for(size_t i=0;i<lenght;i++)
		input=rand()%100;


	compute_timer -= omp_get_wtime();
	ParallelAverage(output,input,lenght);
	compute_timer += omp_get_wtime();
	cout << "Calculation time in parallel = " << compute_timer<<endl;
	compute_timer = 0.;
	
	serial_timer -= omp_get_wtime();
	SerialAverage(temp,input,lenght);
	serial_timer += omp_get_wtime();
	cout<<"Calculation time in serial="<< serial_timer<<endl;

	return 0;
}

But calculating average of next and previous of each item is much faster than in serial loop than parallel one.

Shayan_Y_ · ‎09-02-2015

Thanks.

My first problem is how should I be sure I have TBB on my system ? I have installed parallel studio xe 2016 and in it's directory and in my terminal run this command "source iccvars.sh intel64".

My first example is

#include <tbb/tbb.h>
#include <iostream>
#include <omp.h>

using namespace std;
using namespace tbb;
class Average {
	public:
		float* input;
		    float* output;
	void operator( )( const blocked_range<int>& range ) const {
		for( int i=range.begin(); i!=range.end( ); ++i )
      		{
		        output = (input[i-1]+input+input[i+1])/3.0f;
		}
    	}
};
void ParallelAverage( float* output, float* input, size_t n ) {
	Average avg;
	avg.input = input;
	avg.output = output;
	output[0]=(input[0]+input[1])/2.0f;
	parallel_for( blocked_range<int>( 1, n-1, 100 ), avg );
	output[n-1]=(input[n-1]+input[n-2])/2.0f;
}
void SerialAverage(float* output,float* input,int  n){
	output[0] = (input[0]+input[1])/2.0f;
	for(long int i=1; i<n-1; ++i )
      	{
		output = (input[i-1]+input+input[i+1])/3.0f;
      	}
  	output[n-1] = (input[n-1]+input[n-2])/2.0f;
}
int main()
{
	task_scheduler_init init;
	srand ( time(NULL) );
	unsigned long long int lenght=1000000;
	float input[lenght];
	float output[lenght],temp[lenght];
	double serial_timer = 0.;
	double compute_timer = 0.;
	for(size_t i=0;i<lenght;i++)
		input=rand()%100;


	compute_timer -= omp_get_wtime();
	ParallelAverage(output,input,lenght);
	compute_timer += omp_get_wtime();
	cout << "Calculation time in parallel = " << compute_timer<<endl;
	compute_timer = 0.;
	
	serial_timer -= omp_get_wtime();
	SerialAverage(temp,input,lenght);
	serial_timer += omp_get_wtime();
	cout<<"Calculation time in serial="<< serial_timer<<endl;

	return 0;
}

But calculating average of next and previous of each item is much faster than in serial loop than parallel one.

RafSchietekat · ‎09-02-2015

(Please remove the second copy of your posting. Well, edit it down to "(Removed second copy)" or something.)

TBB has to get out of the starting blocks first (even task-based toolkits have to start up some worker threads), so your timings will be useless in such a small benchmark. Try a few ParallelAverage executions, and only then start the timer. Tip: decrease variance by averaging out several executions.

What's the difference in timing, if any, if you "hoist" end() out of the loop: "for( int i=range.begin(), i_end=range.end(); i!=i_end; ++i )"? This might help the compiler unroll or even auto-vectorise the loop, which can make a huge difference. Alternatively, try accepting the blocked_range by value: maybe the compiler can figure it all out by itself, then. Please give times for the original, both variants, and perhaps the combination: supposedly you should have one "slow" execution" and 2 or 3 that are close to each other and a lot faster.

Also, you should probably use a far larger grainsize for such a simple Body, especially if you have a lot of hardware parallelism on your machine (which makes the auto_partitioner pick a fairly small average effective grainsize). Please also provide timings for what you find here.

Then there's telling the compiler to vectorise, using "#pragma simd" (instead of merely hoping that it might). Same request about timings...