- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi all,
I am pretty new to TBB. I am trying to parallelize more or less existing code, which is well object-oriented. However, with more than one thread, performance degrades by a factor of five compared to the standard serial code and also compared to the parallelized code with one thread.
Is there a document out there - or can anybody help - to explain what the reasons can be for such problems?
There is no data shared. The code just reads (non-changing) data, computes something and collects the results in a concurrent_vector. The data read is copied for each thread by copy constructors.
Any further ideas?
Thanks,
Matthias
I am pretty new to TBB. I am trying to parallelize more or less existing code, which is well object-oriented. However, with more than one thread, performance degrades by a factor of five compared to the standard serial code and also compared to the parallelized code with one thread.
Is there a document out there - or can anybody help - to explain what the reasons can be for such problems?
There is no data shared. The code just reads (non-changing) data, computes something and collects the results in a concurrent_vector. The data read is copied for each thread by copy constructors.
Any further ideas?
Thanks,
Matthias
Link Copied
7 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - richwin
Hi all,
I am pretty new to TBB. I am trying to parallelize more or less existing code, which is well object-oriented. However, with more than one thread, performance degrades by a factor of five compared to the standard serial code and also compared to the parallelized code with one thread.
Is there a document out there - or can anybody help - to explain what the reasons can be for such problems?
There is no data shared. The code just reads (non-changing) data, computes something and collects the results in a concurrent_vector. The data read is copied for each thread by copy constructors.
Any further ideas?
I am pretty new to TBB. I am trying to parallelize more or less existing code, which is well object-oriented. However, with more than one thread, performance degrades by a factor of five compared to the standard serial code and also compared to the parallelized code with one thread.
Is there a document out there - or can anybody help - to explain what the reasons can be for such problems?
There is no data shared. The code just reads (non-changing) data, computes something and collects the results in a concurrent_vector. The data read is copied for each thread by copy constructors.
Any further ideas?
Addition to concurrent_vector can perfectly be a bottleneck, it's inherently serial operation, it won't scale. Check how many local operations (machine instructions or primitive C/C++ statements) you have per single concurrent_vector modification. It must be not less than 10'000 or so.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - Dmitriy Vyukov
Addition to concurrent_vector can perfectly be a bottleneck, it's inherently serial operation, it won't scale. Check how many local operations (machine instructions or primitive C/C++ statements) you have per single concurrent_vector modification. It must be not less than 10'000 or so.
Hi,
I have just removed the concurrent_vector and replaced it by a standard array. No changes!
Regards
M. Richwin
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - richwin
Quoting - Dmitriy Vyukov
Addition to concurrent_vector can perfectly be a bottleneck, it's inherently serial operation, it won't scale. Check how many local operations (machine instructions or primitive C/C++ statements) you have per single concurrent_vector modification. It must be not less than 10'000 or so.
Hi,
I have just removed the concurrent_vector and replaced it by a standard array. No changes!
Regards
M. Richwin
Mike
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - mwhenson
I've found new / delete to be a bottleneck. Try scalable_malloc / scalable_free.
Mike
Mike
Hmm, I use a lot of temporary STL vectors to keep my non-shared data.
Is concurrent_vector better in this sense? Does it make use of scalable_malloc/_free?
Regards
Matthias
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - richwin
Hmm, I use a lot of temporary STL vectors to keep my non-shared data.
Is concurrent_vector better in this sense? Does it make use of scalable_malloc/_free?
Regards
Matthias
thanks for the suggestion - the last did help (I believe... - does concurrent_vector really use scalable_malloc?). I changed all stl-vectors to concurrent_vector. And voil: Speed-up of about 1.2 at my slow hyperthreading machine, and factor 5.something on the 8-core server.
Thanks!
TBB rulez!
Matthias
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - richwin
Hi friends,
thanks for the suggestion - the last did help (I believe... - does concurrent_vector really use scalable_malloc?). I changed all stl-vectors to concurrent_vector. And voil: Speed-up of about 1.2 at my slow hyperthreading machine, and factor 5.something on the 8-core server.
Thanks!
TBB rulez!
Matthias
thanks for the suggestion - the last did help (I believe... - does concurrent_vector really use scalable_malloc?). I changed all stl-vectors to concurrent_vector. And voil: Speed-up of about 1.2 at my slow hyperthreading machine, and factor 5.something on the 8-core server.
Thanks!
TBB rulez!
Matthias
std::vector< T , tbb::scalable_allocator< T > > tVector;
Mike
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
"You'll probably do even better if you use stl with scalable_allocator"
It might be easier to do the following (instead of providing each individual STL collection with an allocator), anywhere in the program (hopefully I didn't make any mistakes reconstructing this):
It might be easier to do the following (instead of providing each individual STL collection with an allocator), anywhere in the program (hopefully I didn't make any mistakes reconstructing this):
[cpp]#include#include "tbb/scalable_allocator.h" void* operator new(std::size_t size) throw(std::bad_alloc) { if(void* ptr = scalable_malloc(size)) return ptr; else throw std::bad_alloc(); } void* operator new(std::size_t size, const std::nothrow_t&) throw() { return scalable_malloc(size); } void operator delete(void* ptr) throw() { scalable_free(ptr); } void operator delete(void* ptr, const std::nothrow_t&) throw() { scalable_free(ptr); } void* operator new[](std::size_t size) throw(std::bad_alloc) { if(void* ptr = scalable_malloc(size)) return ptr; else throw std::bad_alloc(); } void* operator new[](std::size_t size, const std::nothrow_t&) throw() { return scalable_malloc(size); } void operator delete[](void* ptr) throw() { scalable_free(ptr); } void operator delete[](void* ptr, const std::nothrow_t&) throw() { scalable_free(ptr); }[/cpp]

Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page