Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.
This community is designed for sharing of public information. Please do not share Intel or third-party confidential information here.
2421 Discussions

parallel_sort returns nondeterministic results


Hi there, 

From  the document, tbb::parallel_sort is not stable but deterministic, which given the same input, we shall expect the same results. However I had this simple program to sort a vector of int pairs based on the comparison of the first element:

#include <iostream>
#include <stdlib.h>
#include <vector>
#include <map>
#include <tbb/tbb.h>

using namespace std;
using namespace tbb;

int main(int argc, char *argv[])
    const int n     = 1e+5;
    const int max_k = 1e+4;

    vector< pair< int, int > > A(n), B(n);
    for( int i = 0; i < n; i++ )
        A = std::make_pair( rand() % max_k + 1, i );
        B = A;

    // sort array A
    parallel_sort( A.begin(), A.end(),  [=]( const pair< int, int >&a, const pair< int, int >&b )
                                                return a.first < b.first;

    // sort array B
    parallel_sort( B.begin(), B.end(),  [=]( const pair< int, int >&a, const pair< int, int >&b )
                                                return a.first < b.first;

    // check the deterministic
    bool verbose = false;
    int non_equal_count = 0;

    for( int i = 0; i < n; i++ )
        if( A.second != B.second )
            if( verbose )
                cout<<" not matching: A["<<i<<"]: "<<A.second<<"; B["<<i<<"]: "<<B.second<<endl;

    cout<<" tbb parallel sort a vector of "<<n<<" elements twice : ";
    if( !non_equal_count  )
        cout<<" getting exactly the same results "<<endl;
        cout<<" getting "<<non_equal_count<<" different elements "<<endl;

    return 1;

The results are unfortunately not deterministic. I'm not sure which part I've missed, and hope someone can spot some problems in my code or point out how to use the function properly. 

To build the code:  icpc -std=gnu++0x -I/opt/intel/tbb/include -L/opt/intel/tbb/lib/intel64/gcc4.4/ -ltbb -o tbb_sort tbb_sort.cpp

Here's tbb version and other related platform information:

TBB: VERSION        4.4
TBB: BUILD_DATE        Tue Jul 28 15:23:27 UTC 2015
TBB: BUILD_HOST        nntpdsd52-161 (x86_64)
TBB: BUILD_OS        SUSE Linux Enterprise Server 10 (x86_64)
TBB: BUILD_KERNEL    Linux #1 SMP Thu Mar 17 11:45:06 UTC 2011
TBB: BUILD_GCC        g++ (GCC) 4.6.1
TBB: BUILD_COMPILER    ICC: Intel(R) C Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version Build 20150407
TBB: BUILD_LD        
TBB: BUILD_TARGET    intel64 on cc4.6.1_libc2.4_kernel2.6.16.21_cpp11
TBB: BUILD_COMMAND    icpc -DDO_ITT_NOTIFY -O2 -g -DUSE_PTHREAD -strict-ansi -fPIC -D__TBB_BUILD=1 -w1 -Werror -std=c++0x -D_TBB_CPP0X -I../../src -I../../src/rml/include -I../../include -I.
TBB: ALLOCATOR    scalable_malloc
TBB: RML    private
TBB: Tools support    disabled
TBBmalloc: VERSION        4.4
TBBmalloc: BUILD_DATE        Tue Jul 28 15:24:08 UTC 2015
TBBmalloc: BUILD_HOST        nntpdsd52-161 (x86_64)
TBBmalloc: BUILD_OS        SUSE Linux Enterprise Server 10 (x86_64)
TBBmalloc: BUILD_KERNEL    Linux #1 SMP Thu Mar 17 11:45:06 UTC 2011
TBBmalloc: BUILD_GCC        g++ (GCC) 4.6.1
TBBmalloc: BUILD_COMPILER    ICC: Intel(R) C Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version Build 20150407
TBBmalloc: BUILD_LIBC    2.4
TBBmalloc: BUILD_LD        
TBBmalloc: BUILD_TARGET    intel64 on cc4.6.1_libc2.4_kernel2.6.16.21_cpp11
TBBmalloc: BUILD_COMMAND    icpc -DDO_ITT_NOTIFY -O2 -g -DUSE_PTHREAD -strict-ansi -w1 -Werror -std=c++0x -D_TBB_CPP0X -I../../src -I../../src/rml/include -I../../include
TBBmalloc: TBB_USE_DEBUG    0
TBBmalloc: TBB_USE_ASSERT    0
TBBmalloc: DO_ITT_NOTIFY    1
TBBmalloc: huge pages    not requested


Many thanks in advance for any input.





0 Kudos
4 Replies
Black Belt

That claim of determinism can not be made for the current implementation.

One solution would be to use a simple_partitioner instead of an auto_partitioner, another, perhaps slightly more efficient, to implement the parallel base case in terms of range splitting down to a deterministic recursive base case (the same as would be obtained from the use of simple_partitioner).

I don't see a workaround other than to modify the source code.

(Added) Of course, even the solutions proposed above still depend on determinism of std::sort(), and in another thread somebody claimed to have seen a parallel implementation (I would still like to see that confirmed or debunked!), so that would have to be disclosed in the claim. A way around that problem is to provide a full implementation even for the recursive base case, or perhaps to have a recursive base case that is smaller than the parallel base case and to implement it in terms of std::stable_sort(), or perhaps even to have a recursive base case of size 1 if benchmark results support that solution. (BTW, I think I presented a proposal for a stable parallel sort on this forum some time ago.)



There is definitely a discrepancy between the current documentation and the current implementation, thanks for catching this! We will discuss in the team what we need to correct or improve. Meanwhile as workaround you can change auto_partitioner to simple_partifioner in the “<tbbdir>\include\tbb\parallel_sort.h” file like Raf just mentioned. 



Thank you guys! I tried "simple_partitioner" and now it is deterministic! I hope I get the rationale right here: the "auto_partitioner" is adapting chunk size based on runtime balance while "simple_partitioner" is using a fixed chunk? 

Btw I tested the the code on a larger data ( around 300 millions ), the version with "simple_partitioner" is roughly 5% slower, which is still pretty impressive. I do hope that the future TBB release would offer "deterministic" versions for sorting. 

Many thanks,



Black Belt

Yes, simple_partitioner only looks at grainsize, so the chunking would be deterministic.

I'm curious about that 5% difference. What happens if you restore the use of auto_partioner and try the following instead (only the second change is relevant):

This should make no difference other than to resolve a curious inconsistency with the rest of TBB:

bool is_divisible() const {return grainsize<size;}

This would only improve over the use of simple_partioner if the problem was with parallel overhead:

//! Body class for the parallel base case.
/** @ingroup algorithms */
template<typename RandomAccessIterator, typename Compare>
struct quick_sort_body {
    void operator()( quick_sort_range<RandomAccessIterator,Compare>& range ) const {
        //SerialQuickSort( range.begin, range.size, range.comp );
        if (range.is_divisible()) {
            quick_sort_range<RandomAccessIterator,Compare> right(range, tbb::split());
            operator() (range);
            operator() (right);
        } else { // recursive base case
            std::sort( range.begin, range.begin + range.size, range.comp );