New to Intel Tbb - Question regarding performance. - Page 2

dreamthtr · ‎04-29-2009

Hi I'm new to Intel TBB and am trying to get more performance out of my app.

My applications runs a simulation of gravity between particles of different mass in a 2d field, anyway the simplest way to calculate this is to take each particle and calculate the forces that are exterted in every one of the other particles.

Any way i have a cicle that used to look like this:

[cpp]for(int i = 0; i	for(int j=0; j	{
		if(oM != oM){
			oM->calculaPhys(oM);
			veces++;
		}
	}
}[/cpp]

I modified that code to use a parallel_for like this:

[cpp]class ApplyPhys {
	objectMass *const my_Obj;
public:
	ApplyPhys( objectMass obj[]): my_Obj(obj){}
	void operator()( const blocked_range& r ) const{
		objectMass *obj = my_Obj;
		for( size_t i = r.begin(); i != r.end(); i++){
			funcTest(&obj);
		}
	}
	
};

void funcTest(objectMass * obj){
	for (int i=0; i		if(obj != oM){
			if(obj->mass != 0){
					veces++;
				obj->calculaPhys(oM);
			}
			if( abs(obj->x - oM->x) < 0.5 && abs(obj->y - oM->y) < 0.5){
					obj->mass += oM->mass;
					oM->mass =0;
			}
		}
	}
}[/cpp]




[cpp]parallel_for(blocked_range(0,CANTIDAD,1),ApplyPhys(*oM),auto_partitioner());[/cpp]

Anyway I'm getting worse performance on using parallel_for that when using a serial approach.

What could I do to increase performance?

matteocilk_com · ‎05-08-2009

Quoting - Raf Schietekat

I would just use a parallel_for(blocked_range2d) without first folding the domain along the diagonal, speculating that it is cheaper to do the work twice than to suffer all the synchronisation barriers implied by the fancy algorithm shown above. Has that algorithm's performance been explored for different combinations of problem size, number of cores, amount of data/work per pair? Or is that superfluous and can reasoning alone decide? Should I be embarrassed for not seeing the light as well as jealous for not having discovered it myself, or can I breathe easy? :-)

Should my suspicions have any merit, I would expect the main concern to instead be how to recycle data between tasks in the same thread, i.e., perhaps just subdividing down the middle of the longest side in blocked_range2d might not be optimal in this case, maybe it should be a blocked_range on only one dimension anyway, with a carefully chosen strip width (based on cache size, not number of elements and cores, i.e., using a simple_partitioner), and evaluated by sweeping a perpendicular row/column of accumulators along its length.

So, am I mistaken, or not really, or really not? :-)

Raf,

You are making it more complicated than it is, I think. Of course there is synchronization overhead---the
way you minimize it is by stopping the recursion at a larger base case. For example, I have modified
rect() to recur only when (i > 32 && j > 32) instead of (i > 1 && j > 1) and used this more realistic
force:

#define N 30000

double force;
double pos[3];
double mass;

const double G = 6.673e-11; // gravitational constant

static inline double f(int i, int j)
{
double dx = pos[0] - pos[0];
double dy = pos[1] - pos[1];
double dz = pos[2] - pos[2];
double r2 = dx * dx + dy * dy + dz * dz;
return G * (mass * mass) / r2;
}

static inline void basecase(int i, int j)
{
if (i != j) {
double t = f(i, j);
force += t;
force -= t;
}
}

I then compared interact() against this simple sequential routine:

void sequential(int n)
{
for (int i = 0; i < N; ++i)
for (int j = i; j < N; ++j)
basecase(i, j);
}

For the given value of N = 30000, both sequential() and interact() take about 4.3 seconds on a machine built by an Intel competitor, using gcc/cilk++ -O3. interact() scales linearly up to 16 cores (the maximum I have at hand).

While I have not coded the same example using TBB, I have no doubt that a straightfoward TBB implementation would solve the problem just as well. The TBB overheads are pretty low, and if they turned out to be too high, just make 32 bigger. A larger base case forgives many sins (at the expense of parallelism).

(Your concerns would probably be justified if you were to code this example using pthreads. I agree that life would suck in this case.)

RafSchietekat · ‎05-08-2009

"interact() scales linearly up to 16 cores (the maximum I have at hand)."
That would seem like an acceptable outcome... sorry, I had missed any previous statement regarding scalability, and I am a bit surprised by this.

Dmitry_Vyukov · ‎05-09-2009

This may be of help:
http://www.cilk.com/multicore-blog/bid/9328/A-cute-technique-for-avoiding-certain-race-conditions

robert-reed · ‎07-09-2009

About two months ago I suggested that Matteo's algorithm might also work using TBB parallel_invoke, a new parallel construct in the open source release. I've been working on an example and have some great results thatI intend to share in a blog series I've started. My goal is to build a case study using the tools of Intel Parallel Studio in the development and discovery process.