Yes, simple_partitioner only

xian_x_ · ‎11-03-2015

Hi there,

From the document, tbb::parallel_sort is not stable but deterministic, which given the same input, we shall expect the same results. However I had this simple program to sort a vector of int pairs based on the comparison of the first element:

#include <iostream>
#include <stdlib.h>
#include <vector>
#include <map>
#include <tbb/tbb.h>

using namespace std;
using namespace tbb;

int main(int argc, char *argv[])
{
    const int n     = 1e+5;
    const int max_k = 1e+4;

    srand(0);
    vector< pair< int, int > > A(n), B(n);
    for( int i = 0; i < n; i++ )
    {
        A = std::make_pair( rand() % max_k + 1, i );
        B = A;
    }

    // sort array A
    parallel_sort( A.begin(), A.end(),  [=]( const pair< int, int >&a, const pair< int, int >&b )
                                          {
                                                return a.first < b.first;
                                          });

    // sort array B
    parallel_sort( B.begin(), B.end(),  [=]( const pair< int, int >&a, const pair< int, int >&b )
                                          {
                                                return a.first < b.first;
                                          });

    // check the deterministic
    bool verbose = false;
    int non_equal_count = 0;

    for( int i = 0; i < n; i++ )
    {
        if( A.second != B.second )
        {
            non_equal_count++;
            if( verbose )
                cout<<" not matching: A["<<i<<"]: "<<A.second<<"; B["<<i<<"]: "<<B.second<<endl;
        }
    }

    cout<<" tbb parallel sort a vector of "<<n<<" elements twice : ";
    if( !non_equal_count  )
    {
        cout<<" getting exactly the same results "<<endl;
    }
    else
    {
        cout<<" getting "<<non_equal_count<<" different elements "<<endl;
    }

    return 1;
}

The results are unfortunately not deterministic. I'm not sure which part I've missed, and hope someone can spot some problems in my code or point out how to use the function properly.

To build the code: icpc -std=gnu++0x -I/opt/intel/tbb/include -L/opt/intel/tbb/lib/intel64/gcc4.4/ -ltbb -o tbb_sort tbb_sort.cpp

Here's tbb version and other related platform information:

TBB: VERSION       4.4
TBB: INTERFACE VERSION   9000
TBB: BUILD_DATE       Tue Jul 28 15:23:27 UTC 2015
TBB: BUILD_HOST       nntpdsd52-161 (x86_64)
TBB: BUILD_OS       SUSE Linux Enterprise Server 10 (x86_64)
TBB: BUILD_KERNEL   Linux 2.6.16.60-0.85.1-smp #1 SMP Thu Mar 17 11:45:06 UTC 2011
TBB: BUILD_GCC       g++ (GCC) 4.6.1
TBB: BUILD_COMPILER   ICC: Intel(R) C Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 15.0.3.187 Build 20150407
TBB: BUILD_LIBC   2.4
TBB: BUILD_LD
TBB: BUILD_TARGET   intel64 on cc4.6.1_libc2.4_kernel2.6.16.21_cpp11
TBB: BUILD_COMMAND   icpc -DDO_ITT_NOTIFY -O2 -g -DUSE_PTHREAD -strict-ansi -fPIC -D__TBB_BUILD=1 -w1 -Werror -std=c++0x -D_TBB_CPP0X -I../../src -I../../src/rml/include -I../../include -I.
TBB: TBB_USE_DEBUG   0
TBB: TBB_USE_ASSERT   0
TBB: DO_ITT_NOTIFY   1
TBB: ALLOCATOR   scalable_malloc
TBB: RML   private
TBB: Tools support   disabled
TBBmalloc: VERSION       4.4
TBBmalloc: INTERFACE VERSION   9000
TBBmalloc: BUILD_DATE       Tue Jul 28 15:24:08 UTC 2015
TBBmalloc: BUILD_HOST       nntpdsd52-161 (x86_64)
TBBmalloc: BUILD_OS       SUSE Linux Enterprise Server 10 (x86_64)
TBBmalloc: BUILD_KERNEL   Linux 2.6.16.60-0.85.1-smp #1 SMP Thu Mar 17 11:45:06 UTC 2011
TBBmalloc: BUILD_GCC       g++ (GCC) 4.6.1
TBBmalloc: BUILD_COMPILER   ICC: Intel(R) C Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 15.0.3.187 Build 20150407
TBBmalloc: BUILD_LIBC   2.4
TBBmalloc: BUILD_LD
TBBmalloc: BUILD_TARGET   intel64 on cc4.6.1_libc2.4_kernel2.6.16.21_cpp11
TBBmalloc: BUILD_COMMAND   icpc -DDO_ITT_NOTIFY -O2 -g -DUSE_PTHREAD -strict-ansi -w1 -Werror -std=c++0x -D_TBB_CPP0X -I../../src -I../../src/rml/include -I../../include
TBBmalloc: TBB_USE_DEBUG   0
TBBmalloc: TBB_USE_ASSERT   0
TBBmalloc: DO_ITT_NOTIFY   1
TBBmalloc: huge pages   not requested

Many thanks in advance for any input.

Cheers,

Xiao-Xian

RafSchietekat · ‎11-03-2015

That claim of determinism can not be made for the current implementation.

One solution would be to use a simple_partitioner instead of an auto_partitioner, another, perhaps slightly more efficient, to implement the parallel base case in terms of range splitting down to a deterministic recursive base case (the same as would be obtained from the use of simple_partitioner).

I don't see a workaround other than to modify the source code.

(Added) Of course, even the solutions proposed above still depend on determinism of std::sort(), and in another thread somebody claimed to have seen a parallel implementation (I would still like to see that confirmed or debunked!), so that would have to be disclosed in the claim. A way around that problem is to provide a full implementation even for the recursive base case, or perhaps to have a recursive base case that is smaller than the parallel base case and to implement it in terms of std::stable_sort(), or perhaps even to have a recursive base case of size 1 if benchmark results support that solution. (BTW, I think I presented a proposal for a stable parallel sort on this forum some time ago.)

Vladimir_P_1234567890 · ‎11-04-2015

hello.

There is definitely a discrepancy between the current documentation and the current implementation, thanks for catching this! We will discuss in the team what we need to correct or improve. Meanwhile as workaround you can change auto_partitioner to simple_partifioner in the “<tbbdir>\include\tbb\parallel_sort.h” file like Raf just mentioned.

--Vladimir

xian_x_ · ‎11-07-2015

Thank you guys! I tried "simple_partitioner" and now it is deterministic! I hope I get the rationale right here: the "auto_partitioner" is adapting chunk size based on runtime balance while "simple_partitioner" is using a fixed chunk?

Btw I tested the the code on a larger data ( around 300 millions ), the version with "simple_partitioner" is roughly 5% slower, which is still pretty impressive. I do hope that the future TBB release would offer "deterministic" versions for sorting.

Many thanks,

Xiao-Xian

RafSchietekat · ‎11-07-2015

Yes, simple_partitioner only looks at grainsize, so the chunking would be deterministic.

I'm curious about that 5% difference. What happens if you restore the use of auto_partioner and try the following instead (only the second change is relevant):

This should make no difference other than to resolve a curious inconsistency with the rest of TBB:

bool is_divisible() const {return grainsize<size;}

This would only improve over the use of simple_partioner if the problem was with parallel overhead:

//! Body class for the parallel base case.
/** @ingroup algorithms */
template<typename RandomAccessIterator, typename Compare>
struct quick_sort_body {
    void operator()( quick_sort_range<RandomAccessIterator,Compare>& range ) const {
        //SerialQuickSort( range.begin, range.size, range.comp );
        if (range.is_divisible()) {
            quick_sort_range<RandomAccessIterator,Compare> right(range, tbb::split());
            operator() (range);
            operator() (right);
        } else { // recursive base case
            std::sort( range.begin, range.begin + range.size, range.comp );
        }
    }
};

parallel_sort returns nondeterministic results