- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi there,
From the document, tbb::parallel_sort is not stable but deterministic, which given the same input, we shall expect the same results. However I had this simple program to sort a vector of int pairs based on the comparison of the first element:
#include <iostream> #include <stdlib.h> #include <vector> #include <map> #include <tbb/tbb.h> using namespace std; using namespace tbb; int main(int argc, char *argv[]) { const int n = 1e+5; const int max_k = 1e+4; srand(0); vector< pair< int, int > > A(n), B(n); for( int i = 0; i < n; i++ ) { A = std::make_pair( rand() % max_k + 1, i ); B = A; } // sort array A parallel_sort( A.begin(), A.end(), [=]( const pair< int, int >&a, const pair< int, int >&b ) { return a.first < b.first; }); // sort array B parallel_sort( B.begin(), B.end(), [=]( const pair< int, int >&a, const pair< int, int >&b ) { return a.first < b.first; }); // check the deterministic bool verbose = false; int non_equal_count = 0; for( int i = 0; i < n; i++ ) { if( A.second != B.second ) { non_equal_count++; if( verbose ) cout<<" not matching: A["<<i<<"]: "<<A.second<<"; B["<<i<<"]: "<<B.second<<endl; } } cout<<" tbb parallel sort a vector of "<<n<<" elements twice : "; if( !non_equal_count ) { cout<<" getting exactly the same results "<<endl; } else { cout<<" getting "<<non_equal_count<<" different elements "<<endl; } return 1; }
The results are unfortunately not deterministic. I'm not sure which part I've missed, and hope someone can spot some problems in my code or point out how to use the function properly.
To build the code: icpc -std=gnu++0x -I/opt/intel/tbb/include -L/opt/intel/tbb/lib/intel64/gcc4.4/ -ltbb -o tbb_sort tbb_sort.cpp
Here's tbb version and other related platform information:
TBB: VERSION 4.4
TBB: INTERFACE VERSION 9000
TBB: BUILD_DATE Tue Jul 28 15:23:27 UTC 2015
TBB: BUILD_HOST nntpdsd52-161 (x86_64)
TBB: BUILD_OS SUSE Linux Enterprise Server 10 (x86_64)
TBB: BUILD_KERNEL Linux 2.6.16.60-0.85.1-smp #1 SMP Thu Mar 17 11:45:06 UTC 2011
TBB: BUILD_GCC g++ (GCC) 4.6.1
TBB: BUILD_COMPILER ICC: Intel(R) C Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 15.0.3.187 Build 20150407
TBB: BUILD_LIBC 2.4
TBB: BUILD_LD
TBB: BUILD_TARGET intel64 on cc4.6.1_libc2.4_kernel2.6.16.21_cpp11
TBB: BUILD_COMMAND icpc -DDO_ITT_NOTIFY -O2 -g -DUSE_PTHREAD -strict-ansi -fPIC -D__TBB_BUILD=1 -w1 -Werror -std=c++0x -D_TBB_CPP0X -I../../src -I../../src/rml/include -I../../include -I.
TBB: TBB_USE_DEBUG 0
TBB: TBB_USE_ASSERT 0
TBB: DO_ITT_NOTIFY 1
TBB: ALLOCATOR scalable_malloc
TBB: RML private
TBB: Tools support disabled
TBBmalloc: VERSION 4.4
TBBmalloc: INTERFACE VERSION 9000
TBBmalloc: BUILD_DATE Tue Jul 28 15:24:08 UTC 2015
TBBmalloc: BUILD_HOST nntpdsd52-161 (x86_64)
TBBmalloc: BUILD_OS SUSE Linux Enterprise Server 10 (x86_64)
TBBmalloc: BUILD_KERNEL Linux 2.6.16.60-0.85.1-smp #1 SMP Thu Mar 17 11:45:06 UTC 2011
TBBmalloc: BUILD_GCC g++ (GCC) 4.6.1
TBBmalloc: BUILD_COMPILER ICC: Intel(R) C Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 15.0.3.187 Build 20150407
TBBmalloc: BUILD_LIBC 2.4
TBBmalloc: BUILD_LD
TBBmalloc: BUILD_TARGET intel64 on cc4.6.1_libc2.4_kernel2.6.16.21_cpp11
TBBmalloc: BUILD_COMMAND icpc -DDO_ITT_NOTIFY -O2 -g -DUSE_PTHREAD -strict-ansi -w1 -Werror -std=c++0x -D_TBB_CPP0X -I../../src -I../../src/rml/include -I../../include
TBBmalloc: TBB_USE_DEBUG 0
TBBmalloc: TBB_USE_ASSERT 0
TBBmalloc: DO_ITT_NOTIFY 1
TBBmalloc: huge pages not requested
Many thanks in advance for any input.
Cheers,
Xiao-Xian
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
That claim of determinism can not be made for the current implementation.
One solution would be to use a simple_partitioner instead of an auto_partitioner, another, perhaps slightly more efficient, to implement the parallel base case in terms of range splitting down to a deterministic recursive base case (the same as would be obtained from the use of simple_partitioner).
I don't see a workaround other than to modify the source code.
(Added) Of course, even the solutions proposed above still depend on determinism of std::sort(), and in another thread somebody claimed to have seen a parallel implementation (I would still like to see that confirmed or debunked!), so that would have to be disclosed in the claim. A way around that problem is to provide a full implementation even for the recursive base case, or perhaps to have a recursive base case that is smaller than the parallel base case and to implement it in terms of std::stable_sort(), or perhaps even to have a recursive base case of size 1 if benchmark results support that solution. (BTW, I think I presented a proposal for a stable parallel sort on this forum some time ago.)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
hello.
There is definitely a discrepancy between the current documentation and the current implementation, thanks for catching this! We will discuss in the team what we need to correct or improve. Meanwhile as workaround you can change auto_partitioner to simple_partifioner in the “<tbbdir>\include\tbb\parallel_sort.h” file like Raf just mentioned.
--Vladimir
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you guys! I tried "simple_partitioner" and now it is deterministic! I hope I get the rationale right here: the "auto_partitioner" is adapting chunk size based on runtime balance while "simple_partitioner" is using a fixed chunk?
Btw I tested the the code on a larger data ( around 300 millions ), the version with "simple_partitioner" is roughly 5% slower, which is still pretty impressive. I do hope that the future TBB release would offer "deterministic" versions for sorting.
Many thanks,
Xiao-Xian
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yes, simple_partitioner only looks at grainsize, so the chunking would be deterministic.
I'm curious about that 5% difference. What happens if you restore the use of auto_partioner and try the following instead (only the second change is relevant):
This should make no difference other than to resolve a curious inconsistency with the rest of TBB:
bool is_divisible() const {return grainsize<size;}
This would only improve over the use of simple_partioner if the problem was with parallel overhead:
//! Body class for the parallel base case. /** @ingroup algorithms */ template<typename RandomAccessIterator, typename Compare> struct quick_sort_body { void operator()( quick_sort_range<RandomAccessIterator,Compare>& range ) const { //SerialQuickSort( range.begin, range.size, range.comp ); if (range.is_divisible()) { quick_sort_range<RandomAccessIterator,Compare> right(range, tbb::split()); operator() (range); operator() (right); } else { // recursive base case std::sort( range.begin, range.begin + range.size, range.comp ); } } };

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page