Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.

Possible Reasons for bad performance

richwin
Beginner
137 Views
Hi all,

I am pretty new to TBB. I am trying to parallelize more or less existing code, which is well object-oriented. However, with more than one thread, performance degrades by a factor of five compared to the standard serial code and also compared to the parallelized code with one thread.

Is there a document out there - or can anybody help - to explain what the reasons can be for such problems?

There is no data shared. The code just reads (non-changing) data, computes something and collects the results in a concurrent_vector. The data read is copied for each thread by copy constructors.

Any further ideas?
Thanks,
Matthias
0 Kudos
7 Replies
Dmitry_Vyukov
Valued Contributor I
137 Views
Quoting - richwin
Hi all,

I am pretty new to TBB. I am trying to parallelize more or less existing code, which is well object-oriented. However, with more than one thread, performance degrades by a factor of five compared to the standard serial code and also compared to the parallelized code with one thread.

Is there a document out there - or can anybody help - to explain what the reasons can be for such problems?

There is no data shared. The code just reads (non-changing) data, computes something and collects the results in a concurrent_vector. The data read is copied for each thread by copy constructors.

Any further ideas?



Addition to concurrent_vector can perfectly be a bottleneck, it's inherently serial operation, it won't scale. Check how many local operations (machine instructions or primitive C/C++ statements) you have per single concurrent_vector modification. It must be not less than 10'000 or so.

richwin
Beginner
137 Views
Quoting - Dmitriy Vyukov

Addition to concurrent_vector can perfectly be a bottleneck, it's inherently serial operation, it won't scale. Check how many local operations (machine instructions or primitive C/C++ statements) you have per single concurrent_vector modification. It must be not less than 10'000 or so.


Hi,

I have just removed the concurrent_vector and replaced it by a standard array. No changes!

Regards
M. Richwin
mwhenson
Beginner
137 Views
Quoting - richwin
Quoting - Dmitriy Vyukov

Addition to concurrent_vector can perfectly be a bottleneck, it's inherently serial operation, it won't scale. Check how many local operations (machine instructions or primitive C/C++ statements) you have per single concurrent_vector modification. It must be not less than 10'000 or so.


Hi,

I have just removed the concurrent_vector and replaced it by a standard array. No changes!

Regards
M. Richwin
I've found new / delete to be a bottleneck. Try scalable_malloc / scalable_free.

Mike
richwin
Beginner
137 Views
Quoting - mwhenson
I've found new / delete to be a bottleneck. Try scalable_malloc / scalable_free.

Mike


Hmm, I use a lot of temporary STL vectors to keep my non-shared data.

Is concurrent_vector better in this sense? Does it make use of scalable_malloc/_free?

Regards
Matthias
richwin
Beginner
137 Views
Quoting - richwin


Hmm, I use a lot of temporary STL vectors to keep my non-shared data.

Is concurrent_vector better in this sense? Does it make use of scalable_malloc/_free?

Regards
Matthias
Hi friends,

thanks for the suggestion - the last did help (I believe... - does concurrent_vector really use scalable_malloc?). I changed all stl-vectors to concurrent_vector. And voil: Speed-up of about 1.2 at my slow hyperthreading machine, and factor 5.something on the 8-core server.

Thanks!
TBB rulez!

Matthias
mwhenson
Beginner
137 Views
Quoting - richwin
Hi friends,

thanks for the suggestion - the last did help (I believe... - does concurrent_vector really use scalable_malloc?). I changed all stl-vectors to concurrent_vector. And voil: Speed-up of about 1.2 at my slow hyperthreading machine, and factor 5.something on the 8-core server.

Thanks!
TBB rulez!

Matthias
You'll probably do even better if you use stl with scalable_allocator. ie

std::vector< T , tbb::scalable_allocator< T > > tVector;

Mike
RafSchietekat
Black Belt
137 Views
"You'll probably do even better if you use stl with scalable_allocator"
It might be easier to do the following (instead of providing each individual STL collection with an allocator), anywhere in the program (hopefully I didn't make any mistakes reconstructing this):

[cpp]#include 
#include "tbb/scalable_allocator.h"
void* operator new(std::size_t size) throw(std::bad_alloc) {
  if(void* ptr = scalable_malloc(size)) return ptr; else throw std::bad_alloc();
  }
void* operator new(std::size_t size, const std::nothrow_t&) throw() {
  return scalable_malloc(size);
  }
void  operator delete(void* ptr) throw() { scalable_free(ptr); }
void  operator delete(void* ptr, const std::nothrow_t&) throw() { scalable_free(ptr); }
void* operator new[](std::size_t size) throw(std::bad_alloc) {
  if(void* ptr = scalable_malloc(size)) return ptr; else throw std::bad_alloc();
  }
void* operator new[](std::size_t size, const std::nothrow_t&) throw() {
  return scalable_malloc(size);
  }
void  operator delete[](void* ptr) throw() { scalable_free(ptr); }
void  operator delete[](void* ptr, const std::nothrow_t&) throw() { scalable_free(ptr); }[/cpp]
Reply