Possible Reasons for bad performance

richwin · ‎05-11-2009

Hi all,

I am pretty new to TBB. I am trying to parallelize more or less existing code, which is well object-oriented. However, with more than one thread, performance degrades by a factor of five compared to the standard serial code and also compared to the parallelized code with one thread.

Is there a document out there - or can anybody help - to explain what the reasons can be for such problems?

There is no data shared. The code just reads (non-changing) data, computes something and collects the results in a concurrent_vector. The data read is copied for each thread by copy constructors.

Any further ideas?
Thanks,
Matthias

Dmitry_Vyukov · ‎05-11-2009

Quoting - richwin

Hi all,

I am pretty new to TBB. I am trying to parallelize more or less existing code, which is well object-oriented. However, with more than one thread, performance degrades by a factor of five compared to the standard serial code and also compared to the parallelized code with one thread.

Is there a document out there - or can anybody help - to explain what the reasons can be for such problems?

There is no data shared. The code just reads (non-changing) data, computes something and collects the results in a concurrent_vector. The data read is copied for each thread by copy constructors.

Any further ideas?

Addition to concurrent_vector can perfectly be a bottleneck, it's inherently serial operation, it won't scale. Check how many local operations (machine instructions or primitive C/C++ statements) you have per single concurrent_vector modification. It must be not less than 10'000 or so.

richwin · ‎05-11-2009

Quoting - Dmitriy Vyukov

Addition to concurrent_vector can perfectly be a bottleneck, it's inherently serial operation, it won't scale. Check how many local operations (machine instructions or primitive C/C++ statements) you have per single concurrent_vector modification. It must be not less than 10'000 or so.

Hi,

I have just removed the concurrent_vector and replaced it by a standard array. No changes!

Regards
M. Richwin

mwhenson · ‎05-11-2009

Quoting - richwin

Quoting - Dmitriy Vyukov

Addition to concurrent_vector can perfectly be a bottleneck, it's inherently serial operation, it won't scale. Check how many local operations (machine instructions or primitive C/C++ statements) you have per single concurrent_vector modification. It must be not less than 10'000 or so.

Hi,

I have just removed the concurrent_vector and replaced it by a standard array. No changes!

Regards
M. Richwin

I've found new / delete to be a bottleneck. Try scalable_malloc / scalable_free.

Mike

richwin · ‎05-11-2009

Quoting - mwhenson

I've found new / delete to be a bottleneck. Try scalable_malloc / scalable_free.

Mike

Hmm, I use a lot of temporary STL vectors to keep my non-shared data.

Is concurrent_vector better in this sense? Does it make use of scalable_malloc/_free?

Regards
Matthias

richwin · ‎05-11-2009

Quoting - richwin

Hmm, I use a lot of temporary STL vectors to keep my non-shared data.

Is concurrent_vector better in this sense? Does it make use of scalable_malloc/_free?

Regards
Matthias

Hi friends,

thanks for the suggestion - the last did help (I believe... - does concurrent_vector really use scalable_malloc?). I changed all stl-vectors to concurrent_vector. And voil: Speed-up of about 1.2 at my slow hyperthreading machine, and factor 5.something on the 8-core server.

Thanks!
TBB rulez!

Matthias

mwhenson · ‎05-11-2009

Quoting - richwin

Hi friends,

thanks for the suggestion - the last did help (I believe... - does concurrent_vector really use scalable_malloc?). I changed all stl-vectors to concurrent_vector. And voil: Speed-up of about 1.2 at my slow hyperthreading machine, and factor 5.something on the 8-core server.

Thanks!
TBB rulez!

Matthias

You'll probably do even better if you use stl with scalable_allocator. ie

std::vector< T , tbb::scalable_allocator< T > > tVector;

Mike

RafSchietekat · ‎05-11-2009

"You'll probably do even better if you use stl with scalable_allocator"
It might be easier to do the following (instead of providing each individual STL collection with an allocator), anywhere in the program (hopefully I didn't make any mistakes reconstructing this):

[cpp]#include 
#include "tbb/scalable_allocator.h"
void* operator new(std::size_t size) throw(std::bad_alloc) {
  if(void* ptr = scalable_malloc(size)) return ptr; else throw std::bad_alloc();
  }
void* operator new(std::size_t size, const std::nothrow_t&) throw() {
  return scalable_malloc(size);
  }
void  operator delete(void* ptr) throw() { scalable_free(ptr); }
void  operator delete(void* ptr, const std::nothrow_t&) throw() { scalable_free(ptr); }
void* operator new[](std::size_t size) throw(std::bad_alloc) {
  if(void* ptr = scalable_malloc(size)) return ptr; else throw std::bad_alloc();
  }
void* operator new[](std::size_t size, const std::nothrow_t&) throw() {
  return scalable_malloc(size);
  }
void  operator delete[](void* ptr) throw() { scalable_free(ptr); }
void  operator delete[](void* ptr, const std::nothrow_t&) throw() { scalable_free(ptr); }[/cpp]