Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.
Announcements
The Intel sign-in experience has changed to support enhanced security controls. If you sign in, click here for more information.

Possible Reasons for bad performance

richwin
Beginner
185 Views
Hi all,

I am pretty new to TBB. I am trying to parallelize more or less existing code, which is well object-oriented. However, with more than one thread, performance degrades by a factor of five compared to the standard serial code and also compared to the parallelized code with one thread.

Is there a document out there - or can anybody help - to explain what the reasons can be for such problems?

There is no data shared. The code just reads (non-changing) data, computes something and collects the results in a concurrent_vector. The data read is copied for each thread by copy constructors.

Any further ideas?
Thanks,
Matthias
0 Kudos
7 Replies
Dmitry_Vyukov
Valued Contributor I
185 Views
Quoting - richwin
Hi all,

I am pretty new to TBB. I am trying to parallelize more or less existing code, which is well object-oriented. However, with more than one thread, performance degrades by a factor of five compared to the standard serial code and also compared to the parallelized code with one thread.

Is there a document out there - or can anybody help - to explain what the reasons can be for such problems?

There is no data shared. The code just reads (non-changing) data, computes something and collects the results in a concurrent_vector. The data read is copied for each thread by copy constructors.

Any further ideas?



Addition to concurrent_vector can perfectly be a bottleneck, it's inherently serial operation, it won't scale. Check how many local operations (machine instructions or primitive C/C++ statements) you have per single concurrent_vector modification. It must be not less than 10'000 or so.

richwin
Beginner
185 Views
Quoting - Dmitriy Vyukov

Addition to concurrent_vector can perfectly be a bottleneck, it's inherently serial operation, it won't scale. Check how many local operations (machine instructions or primitive C/C++ statements) you have per single concurrent_vector modification. It must be not less than 10'000 or so.


Hi,

I have just removed the concurrent_vector and replaced it by a standard array. No changes!

Regards
M. Richwin
mwhenson
Beginner
185 Views
Quoting - richwin
Quoting - Dmitriy Vyukov

Addition to concurrent_vector can perfectly be a bottleneck, it's inherently serial operation, it won't scale. Check how many local operations (machine instructions or primitive C/C++ statements) you have per single concurrent_vector modification. It must be not less than 10'000 or so.


Hi,

I have just removed the concurrent_vector and replaced it by a standard array. No changes!

Regards
M. Richwin
I've found new / delete to be a bottleneck. Try scalable_malloc / scalable_free.

Mike
richwin
Beginner
185 Views
Quoting - mwhenson
I've found new / delete to be a bottleneck. Try scalable_malloc / scalable_free.

Mike


Hmm, I use a lot of temporary STL vectors to keep my non-shared data.

Is concurrent_vector better in this sense? Does it make use of scalable_malloc/_free?

Regards
Matthias
richwin
Beginner
185 Views
Quoting - richwin


Hmm, I use a lot of temporary STL vectors to keep my non-shared data.

Is concurrent_vector better in this sense? Does it make use of scalable_malloc/_free?

Regards
Matthias
Hi friends,

thanks for the suggestion - the last did help (I believe... - does concurrent_vector really use scalable_malloc?). I changed all stl-vectors to concurrent_vector. And voil: Speed-up of about 1.2 at my slow hyperthreading machine, and factor 5.something on the 8-core server.

Thanks!
TBB rulez!

Matthias
mwhenson
Beginner
185 Views
Quoting - richwin
Hi friends,

thanks for the suggestion - the last did help (I believe... - does concurrent_vector really use scalable_malloc?). I changed all stl-vectors to concurrent_vector. And voil: Speed-up of about 1.2 at my slow hyperthreading machine, and factor 5.something on the 8-core server.

Thanks!
TBB rulez!

Matthias
You'll probably do even better if you use stl with scalable_allocator. ie

std::vector< T , tbb::scalable_allocator< T > > tVector;

Mike
RafSchietekat
Black Belt
185 Views
"You'll probably do even better if you use stl with scalable_allocator"
It might be easier to do the following (instead of providing each individual STL collection with an allocator), anywhere in the program (hopefully I didn't make any mistakes reconstructing this):

[cpp]#include 
#include "tbb/scalable_allocator.h"
void* operator new(std::size_t size) throw(std::bad_alloc) {
  if(void* ptr = scalable_malloc(size)) return ptr; else throw std::bad_alloc();
  }
void* operator new(std::size_t size, const std::nothrow_t&) throw() {
  return scalable_malloc(size);
  }
void  operator delete(void* ptr) throw() { scalable_free(ptr); }
void  operator delete(void* ptr, const std::nothrow_t&) throw() { scalable_free(ptr); }
void* operator new[](std::size_t size) throw(std::bad_alloc) {
  if(void* ptr = scalable_malloc(size)) return ptr; else throw std::bad_alloc();
  }
void* operator new[](std::size_t size, const std::nothrow_t&) throw() {
  return scalable_malloc(size);
  }
void  operator delete[](void* ptr) throw() { scalable_free(ptr); }
void  operator delete[](void* ptr, const std::nothrow_t&) throw() { scalable_free(ptr); }[/cpp]
Reply