Community
cancel
Showing results for 
Search instead for 
Did you mean: 
hoodibaba
Beginner
85 Views

TBB program taking almost same time as serial code

Hello All,

I I am very new to TBB. I start using the TBB for optimising my code performance. I am using parallel_reduce as I have to cumalative sum.

My problem is that the distance function takes almost same time when I do the parallel cumalative sum. its takes aroud 0.04 seconds to compute it. I guess with the TBB, its should reduce the processing time.


My serial code is like this:


float Matcher::Distance( float *out1, float *out2, int size ) {

float cumulativeDistance = 0.0;

for ( int i=0; i < size; i++ ) {
cumulativeDistance += (out1-out2)*(out1-out2);
}
return cumulativeDistance;
}

My parallel code version is :

#define GRAIN 10000000
#include
#include
#include
#include
using namespace tbb;
class Distancesum{

private:
float * my_out_1;
float * my_out_2;
public:
float cumulativeDistance;
void operator()(const blocked_range&r) {
float *out_1 =my_out_1;
float *out_2 =my_out_2;
float sum=cumulativeDistance;
const int end= r.end();

for (int i=r.begin(); i!= end; i++ ) {
sum += (my_out_1-my_out_2)*(my_out_1-my_out_2);
}
cumulativeDistance=sum;
}

Distancesum( Distancesum& x,split):my_out_1(x.my_out_1),my_out_2(x.my_out_2),cumulativeDistance(0) {}
void join( const Distancesum& y) { cumulativeDistance=y.cumulativeDistance;}
Distancesum(float *out1, float*out2):
my_out_1(out1),my_out_2(out2),cumulativeDistance(0)
{}
};

float Matcher::Distance( float *out1, float *out2, int size ) {

Distancesum ds(out1,out2);
parallel_reduce(blocked_range(0,size,GRAIN),ds);
return ds.cumulativeDistance;
}

_______________________________________________________________________________________

Ths function Distance is called from some other functions many times.

Few conclusions I make from this code are as follows:
1) When I use auto_partioner or affinity_partioner for automatic grainsize, it takes more time. When I use GRAIN size and no partitioner, it gives good result. I dont know why?!!!

2) I have gone through the post of Alexey Kukanov about Why a simple test can get parallel slowdown, and tried to make the sum as local variable but this doesn't help me to gain the optimized performance. it still takes same time as serial code

Am I doing something wrong? What are the facors which are hindering the perfomance in my code.

Thanks








0 Kudos
20 Replies
RafSchietekat
Black Belt
78 Views

auto_partitioner is the default nowadays, so it seems to come down to grainsize; perhaps somebody else has the time now to analyse this further. You might try an explicit simple_partitioner with and without grainsizes (without defaults to 1) and see how it compares. Please provide some values for size and compute times.

You should probably use += in the join(), though.

(Added) Also try more realistic grainsizes, like 10000 or so.

Andrey_Marochko
New Contributor III
78 Views

I guess auto_partitioner works well because in the beginning it unconditionally divides the iteration space into 4 * P chunks, where P is the current concurerncy level. That is it always creates at least some parallel slack.

In contrast simple partitioner strictly abides by the grain size specified. And in case of the huge value in the example it just do not divide the iteration space at all.
RafSchietekat
Black Belt
78 Views

Andrey, I think the reported problem was that auto_partitioner didn't work well.

Maybe the reduction could at the same time count the number of chunks it executes, to get an idea of the parallel overhead in various cases.
ARCH_R_Intel
Employee
78 Views

What is the typical value of parameter size to method Distance? I ask the inner loopin Distancesum::operator() performs two loads permultiple and add, and so cache effectscan trump parallelism.Is there a way the code can be restructured to reduce the memory bandwidth requirements? Can you say more about the algorithm in the higher levels of the call tree?

hoodibaba
Beginner
78 Views

Hello All,

The value of size is 200 but the function Distance() is called like 200 times.
Also when I use simple_partioner (), it takes almost thrice the time its taking now with specified grain size.

Is the value of size causing this problem?

ARCH_R_Intel
Employee
78 Views

If size is 200, then there is not enough work to amortize scheduling overheads for parallel execution. When the grainsize specified in the blocked_range exceeds size, there is no parallelism and thus low scheduling overhead.

In this sort of situation, I start looking up the call tree to find a higher position from which to parallelize. Can the 200 calls to Distance be done in parallel?
hoodibaba
Beginner
78 Views

Hello Arch,

Thanks for the reply. I changed the GRAINsize to 200 and it shows the same computation time as was showing with 100000.

I need to ask one more thing. How can I convert the following loop into parallel:

for (int i = displacementRow; indexRows < rowsGrid;i += displacementRow, indexRows++)

I am not able to figure it out.

Thanks

ARCH_R_Intel
Employee
78 Views

A blocked_range(begin,end,grainsize) is never split if end-begin<=grainsize. Hence in your example, if size=200 and GRAINSIZE=200, there will be no parallelism. As a practical matter, 200 is probably too small to amortize overheads anyway. Raf's suggestion of 10,000 for a grainsize would be my first guess of a suitable grain size.

When a loop has an indexform that is difficult to parallelize, I first think about how to rewrite it so that the for part looks like "for(k=begin; k

[cpp]if( indexRowsbody
    }
    indexRows=indexRows0+tripcount;    // Compute last value of indexRows
}
[/cpp]
I've been deliberately mechanical in my transformation to make the steps easier to follow. Simplifications based on circumstance are likely possible. Here is what I did:

  • Check if the loop executes and iterations at all. If not, skip the rest of the logic. You can skip this step if both the tripcount calculation is guaranteed to not overflow and you do not need the "last value" computation at the end.
  • Compute the number of times the loop body will execute ("tripcount").
  • Save the value of indexRows for the zeroth iteration of the loop. This value is used inside the loop body to construct the value that indexRows should have on the kth iteration.
  • Loop tripcount times.
  • Inside the loop body, construct the proper values of indexRows and i for the kth iteration. The reconstructions are based on the observation that the values in successive loop iterations form an arithmetic series.Notice that the reconstructed values are local to the loop body.
  • Execute the loop body.
  • After the loop executes, construct the last value that indexRows would have after the original loop executed. You can of course skip this step if the value is not used.
Given a loop in the revised form, it is easy to rewrite it using parallel_for. If the overhead of the index calculations iside the loopis an issue, use a form of parallel_for that takes a blocked_range, and manually optimize the serial for loop that runs over the subrange.

hoodibaba
Beginner
78 Views

Hello Arch,

Thanks for the explanation. I have few queries which are as follows:

1) What si the exact relation of Grain size and the parameter size. Correct me if I am wrong. I understand in a way that we need as many thread as the number of iteration. Am i right in thinking this way?

2) I did not understand the stepo 6 as why I need to calculate the value of i?

3) My original code seems like this:

int indexRows = 0;
for (int i = displacementRow;
indexRows < rowsGrid;
i += displacementRow, indexRows++) {

int indexCols = 0;
for (int j = displacementCol;
indexCols < colsGrid;
j += displacementCol, indexCols++) {

BODY

}
}
Value of DisplacementRow is constant and is 20 and Value of displacementCol is 19 and that is also constant.
Row grid is 11 and colsgrid is 9.

So now according to your change in code for making it look like code which can be used by parallel_for, it seems I need to change the algorithm for both loop.

What is your opinion and approch in such scenario?
ARCH_R_Intel
Employee
78 Views

When using TBB, it's usually best if the code ignores how many threads there are on the system. It's best to pick the grainsize solely based on providing enough work per grain to amortize parallel scheduling overhead.

I calculated i in line 6 because your original loop computed variable i. If i is not used by the loop body, you don't need to caculate it.

Your latest example has more context, so much can be simplified. Here is a sketch of how I would do it.

[cpp]#include "tbb/parallel_for.h"
#include "tbb/blocked_range2d.h"

class Functor {
public:
    void operator()( tbb::blocked_range2d r ) const {
        int rowsEnd = r.rows().end();
        int colsEnd = r.cols().end();
        for( int indexRows=r.rows().begin(); indexRows(0,rowsGrid,0,colsGrid), Functor() );
}[/cpp]

You probably need on the order of 10000 instructions to happen for a single execution of BODY to get good speedup. If BODY has fewer than that number of instructions, the parallel scheduling overheads may swamp the parallelism.

I lifted r.rows().end() and r.cols().end() into local temporary variables so that the compiler can better optimize the code.This step is not essential, but may improve code quality.As a rule of thumb, compilers are excellent at optimizing code involvinglocal variables whose address is not taken (that is not passedby reference or by address). In this case, putting those two variables into local variables helps the compiler do trip count calculations. For a loop with only 20 iterations, it may not be worth the trouble. I point it out only as a general piece of advice.

hoodibaba
Beginner
78 Views

Hello Arch,

Thanks for the very descriptive reply. I am now able to understand the concept. But in the code of yours, you are using displacementRow in Functor() class without defining it in the scope. Same with DisplacementCol; because in the function Example
(), we are not passing the values of DisplacementCol and DisplacementRow.

Also If I am not wrong, Can I call some functions like Foo1(i,j), Foo2(i,j) in the BODY?

Thanks,





ARCH_R_Intel
Employee
78 Views

If displacementRow and displacementCol are not at file scope, pass them via fields in Functor. Here is the previous example extended with this technique:
[cpp]class Functor {
    const int m_displacementRow; 
    const int m_displacementCol;
public:
    Functor( int displacementRow, int displacementCol ) :
        m_displacementRow(displacementRow),
        m_displacementCol(displacementCol)
    {
    }
    void operator()( tbb::blocked_range2d r ) const {
        int displacementRow = m_displacementRow;
        int displacementCol = m_displacementCol;
        int rowsEnd = r.rows().end();
        int colsEnd = r.cols().end();
        for( int indexRows=r.rows().begin(); indexRows(0,rowsGrid,0,colsGrid), Functor(displacementRow,displacementCol) );
}
[/cpp]
A variation is to use m_displacementRow and m_displacementCol directly inside operator() instead of loading them into local temporaries. In this example, use local temporaries might improve the quality of the code.

If your compiler supports C++1x lambda expressions (VS2010, Intel >=11.0 compiler, recent versions of gcc), you can use a lambda expression instead of the hand-coded functor, like this:
[cpp]void Example( int rowsGrid, int colsGrid, int displacementRow, int displacementCol ) {
    tbb::parallel_for( 
        tbb::blocked_range2d(0,rowsGrid,0,colsGrid), 
        [&](tbb::blocked_range2d r ) {
            int rowsEnd = r.rows().end();
            int colsEnd = r.cols().end();
            for( int indexRows=r.rows().begin(); indexRows
It's probably good to understand the explicit Functor form first before using lambda expressions. The Intel compiler documentation says more about lambda expressions. See also the Wikipedia entry on the subject.
hoodibaba
Beginner
78 Views

Hello Arch,

Thanks for the valuable suggestions.I updated the code but still you haven't answered my previous question. May be you overlooked it.

I asked that can I call different functions with almost same arguements in the BODY.

For eg.:
[bash]int indexRows = 0;
for (int i = displacementRow;
indexRows < rowsGrid;
i += displacementRow, indexRows++) {

int indexCols = 0;
for (int j = displacementCol;
indexCols < colsGrid;
j += displacementCol, indexCols++) {

float *a = something;
float distance_one= Distance1(a, i,j);
float distance_two= Distance2(a, i,j)
;
}
}[/bash]
I think it woould work but if I change this code to parallel like this:
[bash]class Functor {
const int m_displacementRow;
const int m_displacementCol;
public:
float distance_one;
float dstance_two;
float *my_a;
float minimum;

Functor( int displacementRow, int displacementCol ) :
m_displacementRow(displacementRow),
m_displacementCol(displacementCol)
{
}
void operator()( tbb::blocked_range2d r ) const {

int displacementRow = m_displacementRow;
int displacementCol = m_displacementCol;
int rowsEnd = r.rows().end();
int colsEnd = r.cols().end();
for( int indexRows=r.rows().begin(); indexRows int i = indexRows*displacementRow;
for( int indexCols=r.cols().begin(); indexCols int j = indexCols*displacementCol;
distance_one = Distance1(my_a,i,j);
distance_two= Distance2(my_a,i,j,8);
minimum = min(distance_one,distance_two)
}
}
}
};[/bash]

I am asking because my Distance1 and Distance2 function are calling the original Distancesum funtion which we tried to parallize earlier sum+=( out1-out2 ) * (out1-out2).

Both the Distance 1 and Distance 2 function involves calling of Distancesum function.

Also As I am new to TBB, I am still going throught the lambda functions. Their implimentations seems to be easy but I think firstly I should be able to use parallel_for in generic way.

hoodibaba
Beginner
78 Views

Hello Arch,

I am able to solve the problem and i converted the code to TBB code. But when I run my code, I see strange behaviour that when the program executes, it only uses one CPU and not both the CPU which it should use.

The snapshot of the system monitois as follows:



here in this graph,
My execution starte from 30 and lasts up to 0 on horizontal axis. Here you can see from the graph that when one CPU executes ,other one executes margially like 20 % or so...Cant I utilize both CPU so that both CPU executes by almost 80-90 %?

Can you please tell If I am forgetting to set the parameter or something else?

Thanks


hoodibaba
Beginner
78 Views

Here is the graph when I run my code in Serial mode.



In this graph, its almost same as that of parallel execution by TBB. One surprising thing in this graph is that in some case CPU 2 or CPU1 reaches 100% but in case of parallel execution, never CPU usage went to reach 100%.

How can I improve on it or if anyone can tell what can be wrong if my strategy?



ARCH_R_Intel
Employee
78 Views

Would it be possible for you to publish the complete code (see Add Files button in the response editor) here. We're overlooking something, and it would be easier to find the problem with the complete piece of code.

- Arch
hoodibaba
Beginner
78 Views

Part.cpp

Hello Arch,

Happy new Year!!!

Please find the link for the code. I cant attach the full code but the code which I put here is part of the module and onnly this part eeds to be spped up. So other part of the module is irrevalenet to be be pasted here.

I tried a lot ..but still now I have no clue why I am not able to get sppedup.

Let me know if you need any other information.



RafSchietekat
Black Belt
78 Views

"mutable GridMatcher & GM;"
Isn't that ill-formed?

And what are LocalDistance and GlobalDistance doing as mutable member variables? They should be local to the block that initially sets them inside the operator, and you can even declare them const.

And what is getDescriptor doing creating objects on the heap in an inner loop?

I have no idea what those static affinity_partitioner objects are meant to accomplish, and I even wouldn't be surprised if they're slowing things down.

"void join( const Distancesum& y) { cumulativeDistance=y.cumulativeDistance;}" won't work unless it's changed to "+=".
ARCH_R_Intel
Employee
78 Views

The mutable members that should be local variables not only cause a correctness problem, they also create severe cache line thrashing.

In addition to Raf's comments, I'll add that the line:

[cpp]parallel_for( blocked_range2d(0,rowsGrid,GRAIN_CS,0,colsGrid,GRAIN_CS), Functor(displacementRow,displacementCol, (*this)),ap);[/cpp]
specifies that the Cartesian rectange [0,rowsGrid)x[0,colsGrid] should be divided into chunks that fit in squares of size GRAIN_CSxGRAIN_CS. Given that GRAIN_CS is defined to be 100000, there is only going to be one chunk unless a dimension of the grid is greater than 105. That seems huge to me. You might try leaving out the grain size specification completely and see what happens. The default auto_partitioner logic is fairly good at dishing out chunks of appropriate size. [We changed the default partitioner from simple_partitioner to auto_partitioner since the TBB book was written.]

To estimate a good chunk size, or at least a starting point for experiments with the chunk size, make a ballpark estimate (say nearest power of ten) of the number of instructions it takes to execute one iteration of the inner loop:
[cpp]int j = (indexCols+1)*displacementCol;
float * out1 = GM.image1.getDescriptor(i - 1, j - 1, 0); // Get the descriptor of the image1.
float GlobalDistance = GM.ComputeOnPoint(out1, i - 1, j - 1);
float LocalDistance = GM.RecurseOnPoint(out1, 8, i - 1, j - 1);
GM.SimilarityMask[indexRows][indexCols] = min(LocalDistance, GlobalDistance);
delete[] out1;[/cpp]

Suppose your estimate is 1000 instructions. Since we want about 100,000 instructions per chunk, we need about a 10x10 subgrid. So set CS_GRAIN=10. Note that CS_GRAIN=1000 is the highest plausible grain size, because that would result in 1000x1000 subgrids. Assuming it takes at least one instruction per grid point, that's >=1000,000 instructions, plenty big for a chunk.

The GRAIN for the Distancesum looks about right, though it's probably worth trying the parallel_reduce without an explicit grain size. It may be tough to get any speedup out of Distancesum because of memory bandwidth and chunksize issues. Specifically, the compiled code for "(out_1-out_2)*(out_1-out_2)" must perform two loads and two arithmetic operations. Depending upon the size of the iteration space, one of three things will happen:
  • The data does not fit in outer level cache. A single core might consume all available memory bandwidth.
  • The data fits in inner level cache. Alas given a 32K L1 cache, that's an iteration space of 2K at best, which is not a big enough chunk size to amortize scheduling overheads. Besides, the data is unlikely to reside in the right L1 cache anyway.
  • The data fits in a shared outer level cache. You might get some speedup in this case.
In general, for a calculation like this, it's faster to consume the values as they are produced, before they can leave L1. Of course that often requires breaking clean boundaries in the code.
hoodibaba
Beginner
8 Views

Hello All,

Thansk Raf for pointing out the mistake in Distancesum and Thanks Arch for the concepts. This discussion is providing me with more and more knowledge.

@Raf..yes the getDescriptor is creating an array with predefined algorithm. Its all together different part and its just get some values from some other place where they are calculated.

Also I have defined Global Distance and Local DIstance as local variables. But still no effect on speedup.

@Arch

I used mutable because I want to make changes to the constant members which belong to Gridmatcher class.Also you said that Distancesum may not give up sppedup because of bandwidth and caches issues. Can you suggest some solution to this issue?

@All

While I am figuring out why I am not getting the speedup, I tried to debug the code and found a surprise in DistanceSum code
The code is as foolows:
[bash]class Distancesum {

private:
float* my_out_1;
float* my_out_2;
public:
float cumulativeDistance;
void operator()(const blocked_range& r) {
float *out_1 = my_out_1;
float *out_2 = my_out_2;
float sum=cumulativeDistance;
const int end= r.end();

for (int i=r.begin(); i!= end; ++i )
sum += (out_1-out_2)*(out_1-out_2); // Parallelize using paralle_reduce
cumulativeDistance = sum; // Assign back the local variable to original variable
}

Distancesum(Distancesum& x, split) : my_out_1(x.my_out_1),my_out_2(x.my_out_2), cumulativeDistance(0) {

} // Assign 0.0 to cumulative Distance
void join(const Distancesum& y) {
cumulativeDistance +=y.cumulativeDistance;
} // Assign back to the original variable
Distancesum(float out_1[], float out_2[]):
my_out_1(out_1), my_out_2(out_2), cumulativeDistance(0)
{}
};


float GridMatcher::Distance( float *out1, float *out2, int size ) {

Distancesum ds(out1,out2);
parallel_reduce(blocked_range(0,size,GRAIN),ds);
return ds.cumulativeDistance;
}
[/bash]
When I do the debugging and place a breakpoint at line marked by **** , my breakpoint never gets hit. I want to know whats happening at deeper level in join. If I put the breakpoint inside Operaotr(), its gets hit and working fine.


[bash] void join(const Distancesum& y) {
**** cumulativeDistance +=y.cumulativeDistance;
} [/bash]
And even If I rewrite the defination of join as
[bash] void join(const Distancesum& y) {} [/bash]
My code still works fine.

SO I am not able to figure out how join is working.

Can anyone see into my code and tell me if I have done somethign wrong?
Thanks

Reply