Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.

Error in a simple program

sviiya
Beginner
1,004 Views
Hello,
I am trying to convert simple recursive programs using TBB and stuckup with the merge-sort. The idea is to split the array into halves and then merge them serially. I cannot see where I am going wrong, probably I missed some control flow. Can anyone spot the bug or conceptual mistake please?

0 Kudos
16 Replies
Dmitry_Vyukov
Valued Contributor I
1,004 Views
Quoting - sviiya
Hello,
I am trying to convert simple recursive programs using TBB and stuckup with the merge-sort. The idea is to split the array into halves and then merge them serially. I cannot see where I am going wrong, probably I missed some control flow. Can anyone spot the bug or conceptual mistake please?


The link is broken.

0 Kudos
vu64
Beginner
1,004 Views

parallel_for( blocked_range( x, y, (y-x)/2 ), merge_sort( ob ) );

When using task_scheduler_init to init only one thread for debug, I see something suspicious about the range called in the object. Perhaps rewriting this line using parallel_invoke may help clear the code a bit.
Debugging for errors is a little bit tedious. I will have to spend more time to spot the error in your code.

0 Kudos
sviiya
Beginner
1,004 Views
@Dmitriy Vyukov: The link works now. Earlier I fixed another bug.


Quoting - vu64

parallel_for( blocked_range( x, y, (y-x)/2 ), merge_sort( ob ) );

When using task_scheduler_init to init only one thread for debug, I see something suspicious about the range called in the object. Perhaps rewriting this line using parallel_invoke may help clear the code a bit.
Debugging for errors is a little bit tedious. I will have to spend more time to spot the error in your code.


Thanks for the sugggestion of using a single thread, it didn't occur to me. :)

I tried using parallel_invoke but the problem is that parallel_invoke takes no arguements. Even lamdas won't work in a simple fashion as local variables can't be passed. The suggestion I saw was that one can use 'bind' from Boost library, but that is complicating the task even further and I am looking for a simple generic solution.

The inital idea was to use parallel_reduce, but the problem again is that the operator overloading comes into picture only once the grain-size is reached (or so it looked). That leaves out the merging of larger segments. Of course, from the O'Reily book on TBB that deals with parallel_for based quicksort, I seriously considered using the same concept of using the fact that elements can be rearranged while splitting, and probably 'join' could be used to merge the subsequences. Again, from simplicity point of view, even the 'simple' quicksort is too complicated to be understood for me with all those many templates and random interators. In other words, even if I implement the same, most of the beginners would not understand the solution!

Hence using parallel_for to concurrently processing two halves just as a typical text book recursive algorithm does, I thought, would be straight-forward to understand and implement.

I believe that the ranges are being processed concurrently. Having taken the fact that the ranges are half-open intervals, I am stumped with this bug. :(

0 Kudos
vu64
Beginner
1,004 Views

I found out another error. When you merge two sorted parts, you declare
int j = mid;
but above mid is (r.end() - r.begin())/2
while j should be r.begin + (r.end() - r.begin())/2

0 Kudos
sviiya
Beginner
1,004 Views
Quoting - vu64

I found out another error. When you merge two sorted parts, you declare
int j = mid;
but above mid is (r.end() - r.begin())/2
while j should be r.begin + (r.end() - r.begin())/2


Thanks vu64.

I found out a version by which I can use merge-sort if the input size is 2^n. The program is attached. This innocous program does not work if the input size is not 2^n ( eg 129) Weird codes!


0 Kudos
vu64
Beginner
1,004 Views

I still think using parallel_invoke is clearer. It's like using cilk_spawn in Cilk++. To pass function with states to parallel_invoke, you can use function objects.
0 Kudos
sviiya
Beginner
1,004 Views
Quoting - vu64

I still think using parallel_invoke is clearer. It's like using cilk_spawn in Cilk++. To pass function with states to parallel_invoke, you can use function objects.

Thanks vu64. I have been able to use functors easily and it works for all input, practically with no change. (The program is given below.) But I am unable fathom the 'bug' with the original version. Using functors made the code cleaner but I suspect it adds some overhead (which is not an issue for this example).

parallel_for is acting in some unexpected (yet consistent) way that I am unable to understand. :(

Thanks again for the advice and the discussion,
-S

[cpp]#include 
#include 
#include 

#define N  9999999

using namespace std;
using namespace tbb;

void merge(int beg, int mid, int end, int *A) {
    vector tmp;
    int i = beg;
    int j = mid;

    while ( ( i < mid ) && ( j < end ) ) {
        if ( A < A ) {
            tmp.push_back( A );
            i++;
        } else {
            tmp.push_back( A );
            j++;
        }
    }

    while ( i < mid ) {
        tmp.push_back( A );
        i++;
    }

    while ( j < end ) {
        tmp.push_back( A );
        j++;
    }

    for ( int t = 0; t < (int) tmp.size(); t++ ) {
        A[ beg + t ] = tmp;
    }
}

void par_ms(int beg, int end, int *A);

class Functor {
    public:
        int beg;
        int end;
        int *A;

        Functor() {}
        Functor(int _b, int _e, int _A[]) : beg(_b), end(_e), A(_A) {}

        void operator()()
        {
            par_ms(beg, end, A);
        }
};


void par_ms(int beg, int end, int *A)
{
    if ( beg + 1 == end ) {
        return;
    }

    int mid = beg + (end - beg)/2;

    Functor frst_half(beg, mid, A);
    Functor secd_half(mid, end, A);

    parallel_invoke( frst_half, secd_half );

    merge( beg, mid, end, A);

    return;
}

int main()
{
    task_scheduler_init init(-1);

    int A;

    for ( int i = 0; i < N; i++ ) {
        A = N - i;
    }

    par_ms(0, N, A);

    for ( int i = 0; i < 10; i++ ) {
        cout << i << " " << A << endl;
    }
    for ( int i = N-10; i < N; i++ ) {
        cout << i << " " << A << endl;
    }

    return 0;
}

[/cpp]

0 Kudos
sviiya
Beginner
1,004 Views
Finally I got to the heart of the matter.

I presumed parallel_for would divide into halves a given interval with a grainsize of (y-x)/2. For eg: [10,20) will be [10,15) + [15,20)

Then there was a clue pointed to me for the case N=3, where grainsize is 1 (=3/2, integer division). This gives 3 intervals and merge works for 2 intervals. So I thought the problem can be fixed by addressing this anamoly of grainsize 1. But that was not to be.

Consider this case:
[0,15) should be split into [0,7) and [7,14) + [14,15) (which is also wrong, but can be fixed by grainsize of 8),
but what actually I see are: [0,7) + [7,11) + [11,15)

Agreed that parallel_for is meant for embarressingly parallel problems for which it makes no difference the way in which ranges are split. My perception of parallel_for was that if grainsize was k, then and interval [0,N) will be divided as [0,k) + [k, 2k) + ... + [m*k, N) and that the last interval could have less than k values. But that was not how it seems to happen. I believe I read it to be so somewhere, but I am unable to recall the exact source.

Using functors actually solved the problem as the range was definitely divided into only two subranges.

The code used for debugging is given below:

[cpp]#include 
#include
#include

#define N 15

using namespace std;
using namespace tbb;

bool done = true;

void merge(int beg, int mid, int end, int *A) {
vector tmp;
int i = beg;
int j = mid;

while ( ( i < mid ) && ( j < end ) ) {
if ( A < A ) {
tmp.push_back( A );
i++;
} else {
tmp.push_back( A );
j++;
}
}

while ( i < mid ) {
tmp.push_back( A );
i++;
}

while ( j < end ) {
tmp.push_back( A );
j++;
}

for ( int t = 0; t < (int) tmp.size(); t++ ) {
A[ beg + t ] = tmp;
}
}

bool par_ms(int beg, int end, int *A);

class merge_sort {
private:
int *A;

public:
merge_sort( int _a[] ) : A( _a ) {}

void operator( )( const blocked_range &r ) const {
par_ms( r.begin(), r.end(), A );
}
};

bool par_ms(int beg, int end, int *A)
{
cout << "PAR_MS: " << beg << " " << end << endl;
if ( beg + 1 == end ) {
return true;
}

merge_sort ob(A);

int half = (end - beg)/2;

if ( done ) {
done = false;
parallel_for( blocked_range( beg, end, half ), merge_sort( ob ) );
}

merge( beg, beg + half, end, A);

return true;
}

int main()
{
task_scheduler_init init(-1);

int A;

for ( int i = 0; i < N; i++ ) {
A = N - i;
cout << i << " " << A << endl;
}
cout << endl;

par_ms(0, N, A);

cout << endl;
for ( int i = 0; i < N; i++ ) {
cout << i << " " << A << endl;
}

return 0;
}
[/cpp]

0 Kudos
vu64
Beginner
1,004 Views
The function object syntax is a little bit verbose. If you use a compiler which supports lambda (Intel C++, Visual C++ 2010, I don't know the status of gcc 4.5 but it seems to), the code will be much shorter. As for performance, function objects are often inlined and the overhead is insignificant.

0 Kudos
sviiya
Beginner
1,004 Views
Quoting - vu64
The function object syntax is a little bit verbose. If you use a compiler which supports lambda (Intel C++, Visual C++ 2010, I don't know the status of gcc 4.5 but it seems to), the code will be much shorter. As for performance, function objects are often inlined and the overhead is insignificant.


I have tried using parallel_invoke with lamdas (with which I am not much familiar) but am having problems. To illustrate the case, the following code does not compile:

[cpp]auto f = [](int x, int y) -> void { cout << x + y << endl; };

f(12,24);

parallel_invoke(
f(20, 64),
f(10, 12)
);
[/cpp]

This is despite the fact that the return type is explicitly declared as void. The statement "f(12,24);" however gives the correct output when parallel_invoke statement is removed.

The following variant also fails for the same reason when uncommented:

[cpp][](int x, int y) -> void { cout << x + y << endl; } (10,20);

/*
parallel_invoke(
[](int x, int y) -> void { cout << x + y << endl; } (10,20),
[](int x, int y) -> void { cout << x + y << endl; } (10,20)
);
*/
[/cpp]

What could be going wrong?

0 Kudos
sviiya
Beginner
1,004 Views
Quoting - sviiya
The following variant also fails for the same reason when uncommented:

[cpp][](int x, int y) -> void { cout << x + y << endl; } (10,20);

/*
parallel_invoke(
[](int x, int y) -> void { cout << x + y << endl; } (10,20),
[](int x, int y) -> void { cout << x + y << endl; } (10,20)
);
*/
[/cpp]

What could be going wrong?


Ok, I could not figure out what was wrong with the above snippet but I could find a better alternative. Now the code looks perfect! :D

[cpp]parallel_invoke(
    [&](){ par_ms(beg, mid, A); },
    [&](){ par_ms(mid, end, A); }
);
[/cpp]

0 Kudos
sviiya
Beginner
1,004 Views
The code now in its simplest form (or can this be further simplified?) stands as:
[cpp]#include 
#include 
#include 

#define N  9999999

using namespace std;
using namespace tbb;

void merge(int beg, int mid, int end, int *A) {
    vector tmp;
    int i = beg;
    int j = mid;

    while ( ( i < mid ) && ( j < end ) ) {
        if ( A < A ) {
            tmp.push_back( A );
            i++;
        } else {
            tmp.push_back( A );
            j++;
        }
    }

    while ( i < mid ) {
        tmp.push_back( A );
        i++;
    }

    while ( j < end ) {
        tmp.push_back( A );
        j++;
    }

    for ( int t = 0; t < (int) tmp.size(); t++ ) {
        A[ beg + t ] = tmp;
    }
}

void par_ms(int beg, int end, int *A)
{
    if ( beg + 1 == end ) {
        return;
    }

    int mid = beg + (end - beg)/2;

    parallel_invoke(
        [&](){ par_ms(beg, mid, A); },
        [&](){ par_ms(mid, end, A); }
    );

    merge( beg, mid, end, A);

    return;
}

int main()
{
    task_scheduler_init init(-1);

    int A;

    for ( int i = 0; i < N; i++ ) {
        A = N - i;
    }

    par_ms(0, N, A);

    for ( int i = 0; i < 10; i++ ) {
        cout << i << " " << A << endl;
    }
    for ( int i = N-10; i < N; i++ ) {
        cout << i << " " << A << endl;
    }

    return 0;
}[/cpp]

0 Kudos
vu64
Beginner
1,004 Views
An alternative is replacing the last 2 while and the for in merge with std::copy.

0 Kudos
sviiya
Beginner
1,004 Views
Quoting - vu64
An alternative is replacing the last 2 while and the for in merge with std::copy.


Yes, that's true. I have not used copy earlier and just looked it up and it looks perfect. :) BTW I was trying to make the code look as similar to the standard mergesort pseudocode, but it is good to know to do the stuff in a better way.

I wrote quicksort with merge function replacing parition. I found that the code performs horribly compared to parallel_sort, which was again based on quicksort (though implemented in a different way). It took about 13 times more time on 8 cores (with CPU usage being ~10%) and less than 6 times when run on a single thread! Not a happy conclusion from the fact that the quicksorts vary so much in the performance..., I was expecting much less of difference. The difference must be due to overhead, but somehow there should be 'some' overhead in the othercase too... Quite a thought provoking exercise this proved to be!
0 Kudos
Alexey-Kukanov
Employee
1,004 Views
Quoting - sviiya

Yes, that's true. I have not used copy earlier and just looked it up and it looks perfect. :) BTW I was trying to make the code look as similar to the standard mergesort pseudocode, but it is good to know to do the stuff in a better way.

I wrote quicksort with merge function replacing parition. I found that the code performs horribly compared to parallel_sort, which was again based on quicksort (though implemented in a different way). It took about 13 times more time on 8 cores (with CPU usage being ~10%) and less than 6 times when run on a single thread! Not a happy conclusion from the fact that the quicksorts vary so much in the performance..., I was expecting much less of difference. The difference must be due to overhead, but somehow there should be 'some' overhead in the othercase too... Quite a thought provoking exercise this proved to be!

The issue is due to granularity I think. It is known that mergesort do not perform well for sizes smaller than some threshold. And also parallelism overhead starts to prevail over useful work for sizes smaller than some threshold. The TBB implementation of parallel_sort takes this into account.
0 Kudos
sviiya
Beginner
1,004 Views

The issue is due to granularity I think. It is known that mergesort do not perform well for sizes smaller than some threshold. And also parallelism overhead starts to prevail over useful work for sizes smaller than some threshold. The TBB implementation of parallel_sort takes this into account.

Yep that's true! The following is the code and it takes similar time. I had to change the min size from the default 1 to 20000. Thanks for reminding me of the minimum grainsize. :)

[cpp]#include 
#include

#define N 20000000

using namespace std;
using namespace tbb;

int partition(int beg, int end, int *A) {
int j = beg-1;

int p = rand()%(end-beg-1);
swap(A[beg+p], A[end-1]);

for ( int i=beg; i < end; i++ ) {
if ( A <= A[end -1] ) {
j++;
swap( A, A );
}
}

return j;
}

void par_qs(int beg, int end, int *A)
{
if ( beg + 20000 >= end ) {
sort(A+beg, A+end);
return;
}

int piv = partition( beg, end, A);

parallel_invoke(
[&](){ par_qs(beg, piv, A); },
[&](){ par_qs(piv+1, end, A); }
);

return;
}

int main()
{
task_scheduler_init init(-1);

int A;

for ( int i = 0; i < N; i++ ) {
A = rand()%N;
}

par_qs(0, N, A);

return 0;
}

[/cpp]
0 Kudos
Reply