Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.

Best way to deal with thread specific workspace

Plagne__Laurent
Beginner
1,440 Views

Dear TBB experts !

 

I would like to know what is the best solution to the following simple problem. I want to perfom a parallel_for with a work functor that requires some workspace (e.g. an array that should not be accessed concurently.

 

1) The simplest solution is to use an automatic vector inside the () operator of my functor

struct WorkFunctor{
  void operator()(block_range..){
       std::vector<double> workspace(2048);
       ... do some work
   }
};
 
Unfortunately, for small granularity, this solution is slow because of the time consumed for allocating/deallocating the workspace array.
 

2) I may improve the situation be using tbb scalable allocator

struct WorkFunctor{
  void operator()(block_range..){
       std::vector<double,tbb::cache_aligned_allocator<double> > workspace(2048);   

    ... do some work
   }
};

3) I improve a bit the perf by using static size array, be I have to be  very carefull with the

stack size per thread (I have encountered erratic bugs due to this issue).

struct WorkFunctor{
  void operator()(block_range..){
      double workspace[2048];

       ... do some work
   }
};

4) I wonder if the use of thread specific storage is the appropriate solution (tbb::enumerable_thread_specific). I found very few examples for this.
 

Thank you in advance for your help.

0 Kudos
26 Replies
RafSchietekat
Valued Contributor III
1,131 Views

I would go with option 2 for peace of mind and postpone any further attempts at optimisation until I had identified the true needs of the program. Well, I'd like to think that I would. :-)

Maybe the question you should ask is why you have "small granularity"? You should have at least a few good chunks when using the default auto_partitioner. If you're not doing something to sabotage that, maybe the total number of chunks is not that high after all?

(Added 2013-12-02) To be explicit: option 3 will always be fastest, so if it really improves the situation only "a bit", then that's your clue to look elsewhere right there. And even if you improve the rest of the program sufficiently to open up this gap, I doubt that hanging on to the memory using TLS could improve things at all, let alone make a splash, but I could always be mistaken.

0 Kudos
jiri
New Contributor I
1,131 Views

I think option 4 is also fine, especially given the way threads are used in TBB. Just make sure you clean up any data that may be present it the thread local storage from the last go.

tbb::enumerable_thread_specific<std::vector<double> > workspaces;

struct WorkFunctor{
  void operator()(block_range..){
      bool exists;
      std::vector<double> &ws=workspaces.local(exists);
      if (exists)
      {
        clear_data_in_workspace(ws);
      }
      else
      {
        build_new_workspace(ws);
      }
     ... do some work
   }
};

//after parallel_for
workspaces.clear();

0 Kudos
Plagne__Laurent
Beginner
1,131 Views

Thank you very much Raf and Jiri !

I will explore the different tracks to evaluate their relative efficiencies. I will then post the results.

Actually, I consider (maybe erroneously) that this need for a thread specific workspace must be frequently encountered. Hence, I found a bit strange not being able to find the canonical tbb pattern corresponding to this issue. Because of its name, the  "thread_specific" stuff looks promising  but I have found very few examples with this construct. In addition, the enumerable or combinable properties are not required for my problem.

Thank you again for your help.

Laurent

0 Kudos
Vladimir_P_1234567890
1,131 Views

hello,

you can try to use tbb:concurrent_vector instead of std::vector

--Vladimir

0 Kudos
RafSchietekat
Valued Contributor III
1,131 Views

About #3: We have no indication that the data left over from a previous use can be reused (I'll assume it's overwritten each time), which means that option 4 won't really help: at best the performance will be closer to 3 (best time), but I understand there's only "a bit" of difference between 2 and 3, so Amdahl's Law has some sobering things to say about that...

About #5: concurrent_vector is good for other things, but never for fixed-size collections (even if they were used concurrently).

 

0 Kudos
Plagne__Laurent
Beginner
1,131 Views

Ouch, there is more options than I thought...

For Vladimir: I do not understand how a concurrent_vector can help in this context.

For Jiri : As Raf guessed, I do not need to clear the workspace (I use it to perform some 3D permutations like X=Y)  hence I think I do not need the "exists" parameter.

For Raf : I did not finish my  measurements but I wonder if #5 should not used because I make several calls to the parallel_for running the WorkFunctor Functor. I believe that the local() method of workspaces will build only one std::vector for each thread (during the first parallel_for call). Hence the allocation overhead may become neglictible:

[cpp]

typedef std::vector<double> WorkSpace;

typedef tbb::enumerable_thread_specific<WorkSpace> WorkSpacePool;

struct WorkFunctor{

   WorkSpacePool * wsPoolPtr_;

   WorkFunctor(WorkSpacePool * wsPoolPtr,...):wsPoolPtr_(wsPoolPtr){}

  void operator()(block_range..){
      WorkSpace & ws=wsPoolPtr_->local();
      .. do some work with ws
   }
};

void aFunction(int ncall=100){

    WorkSpacePool wsPool;

    for (int call=0 ; call<ncall ; call++){

      WorkFunctor wFunctor(&wsPool,...);

      tbb::parallel_for(range,wFunctor);

   }

}

[/cpp]

0 Kudos
jiri
New Contributor I
1,131 Views

laurent.plagne wrote:

I believe that the local() method of workspaces will build only one std::vector for each thread (during the first parallel_for call). Hence the allocation overhead may become neglictible:

Yes, the local() works this way. And given the way threads are used in TBB, this means that consecutive executions of parallel_for will reuse the existing workspaces. So, if your aim is to limit the number of workspace allocations, this would IMHO be a good solution. Without experiments, its not possible to say if it is really worth it. But I think at the very least it should not make things worse compared to other solutions that allocate memory dynamically. You could also consider replacing the standard allocator with TBB's scalable allocator.

0 Kudos
RafSchietekat
Valued Contributor III
1,131 Views

I still don't see the point of trying to improve on the scalable allocator. What percentage are we talking about?

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,131 Views

Raf, the WorkSpacePool eliminates calls to any allocator (on subsequent calls). No matter how fast you make a scalable allocator, not calling it is always faster.

I was going to suggest using a pointer (NULLed) in TLS, then one-time allocated.

The tbb::enumerable_thread_specific... accomplishes a similar goal.

While this does mean the buffer hangs around...
Well that is the point.

Jim Dempsey

0 Kudos
RafSchietekat
Valued Contributor III
1,131 Views

I admit that it could be faster, strictly speaking, but we still don't know by how much (necessarily only part of "a bit"), and what's the point of being afraid of the scalable allocator, invisibly used anyway by all the tasks spawned to support that parallel_for() and now also used here with deallocation (optimally) always in the same thread as allocation, in code that's not heavily reused, i.e., where complexity is not redeemed in number of applications? Should there really be a "pattern": in any non-recursive code, which includes knowing that the thread will never steal other similar work (!), whenever you have a use for more memory than will comfortably fit on the stack, be as phobic toward the scalable allocator as you ever were for other allocators and use TLS for every last bit of performance? I hear already in the original posting that in this case the optimum (stack-based allocation) is only "a bit" faster, and later that the memory is used intensively (for permutations). That suggests that allocation cost may well be a red herring and that maybe extra performance should be sought elsewhere, instead, e.g., by simply specifying a grain size anyway (it is also obeyed by the auto_partitioner), or somewhere in the code that is not visible to us.

(2013-12-08 Shortened) And to repeat the most important objection: the code must know that the memory won't be reused by stolen work, which limits composability (no nested TBB algorithms) and requires maintained attention.

0 Kudos
jiri
New Contributor I
1,131 Views

Raf Schietekat wrote:

And to repeat the most important objection: the code must know that the memory won't be reused by stolen work, which limits composability (no nested TBB algorithms) and requires maintained attention.

You mean when using the TLS containing a vector (with normal or scalable allocator)? Could you elaborate on that?

0 Kudos
Plagne__Laurent
Beginner
1,131 Views

Hi, thank you again for your help !

I made some time measurements on the following code:

[cpp]

#include "tbb/tick_count.h"

#include "tbb/parallel_for.h"
#include "tbb/task_scheduler_init.h"
#include "tbb/cache_aligned_allocator.h"
#include "tbb/enumerable_thread_specific.h"
#include <vector>
#include <iostream>

//#define STACK_BASED
#define STD_VECTOR
//#define SCAL_STD_VECTOR

#ifdef STACK_BASED
template <int BLOCK_SIZE>
struct WorkSpace{
  float data_[BLOCK_SIZE*BLOCK_SIZE*BLOCK_SIZE];
};
#endif

#ifdef STD_VECTOR
template <int BLOCK_SIZE>
struct WorkSpace{
  std::vector<float> data_;
  WorkSpace(void):data_(BLOCK_SIZE*BLOCK_SIZE*BLOCK_SIZE){}
};
#endif

#ifdef SCAL_STD_VECTOR
template <int BLOCK_SIZE>
struct WorkSpace{
  std::vector<float,tbb::cache_aligned_allocator<float> > data_;
  WorkSpace(void):data_(BLOCK_SIZE*BLOCK_SIZE*BLOCK_SIZE){}
};
#endif


template <int BLOCK_SIZE>
class MyFunctor{
private:
  std::vector<float> & X_;
  tbb::enumerable_thread_specific<WorkSpace<BLOCK_SIZE> > * wPoolPtr_;
public:
  MyFunctor(std::vector<float> & X,
        tbb::enumerable_thread_specific<WorkSpace<BLOCK_SIZE> > * wPoolPtr):X_(X),
                                        wPoolPtr_(wPoolPtr)
  {}

  template <class BR>
  void operator()(const BR & br) const {

    const int bs=BLOCK_SIZE;
    const int bs2=bs*BLOCK_SIZE;
    const int bs3=bs2*BLOCK_SIZE;

#define LOCAL_ALLOC
#ifdef LOCAL_ALLOC    
    WorkSpace<bs> wsp;
#else
    WorkSpace<bs> & wsp=wPoolPtr_->local();
#endif

    for (int b=br.begin() ; b!=br.end() ; b++){
      
      const int istart=b*bs3;

      float * XShift=&X_[istart];
      
      //swap into the block w_=Xb
      for (int k=0 ; k<bs ; k++){
        for (int j=0 ; j<bs ; j++){
          for (int i=0 ; i<bs ; i++){
            wsp.data_[j+bs*k+bs2*i]=XShift[i+bs*j+bs2*k];
          }
        }
      }

      //dummy work
      for (int I=0 ; I<bs3 ; I++){
         XShift+=wsp.data_;
      }
    }
  }
};

int main( int argc,  char *argv[] ){
 
  const int bs=128;
  const int nblocks=4;

  const int tn=-1;
  tbb::task_scheduler_init init(tn);
 
  std::vector<float> X(nblocks*bs*bs*bs,1.f);

  tbb::enumerable_thread_specific<WorkSpace<bs> > workspacePool;

  {
    MyFunctor<bs> mf(X,&workspacePool);
    
    tbb::tick_count start=tbb::tick_count::now();
    
    for (int i=0 ; i<100 ; i++){
            tbb::parallel_for(tbb::blocked_range<int>(0,nblocks),mf,tbb::auto_partitioner());
      //tbb::parallel_for(tbb::blocked_range<int>(0,nblocks),mf,tbb::simple_partitioner());
    }
    tbb::tick_count stop=tbb::tick_count::now();
    double d1=(stop-start).seconds();

    std::cout << "d1="<< d1 <<" s" << std::endl;
  }
}

[/cpp]

*******************

For bs=64

With automatic allocation

STACK_BASED 0.42 s

STD_VECTOR 0.63 s

SCAL_STD_VECTOR 0.49 s

With TLS

STACK_BASED 0.43 s

STD_VECTOR 0.43 s

SCAL_STD_VECTOR 0.43 s

For bs =128

With automatic allocation

STACK_BASED seg fault if the stack size per thread is not modified

STD_VECTOR 6.65 s

SCAL_STD_VECTOR 5.28

With TLS

STD_VECTOR 4.2 s

SCAL_STD_VECTOR 4.2 s

So Raf is right to say that the scalable allocator give performance that are close to what we obtain with TLS. When the data gets bigger, the stack based workspace becomes problematic and the TLS perfomance gets better than the scalable allocator. In addition, in my code I have several workspace arrays  that I can group in a single Workspace class used in the enumerable_thread_specific pool. Actually I would feel comfortable with the TLS solution but I am a bit afraid of Raf warnings...

I may explain that I have two different goals for my design:

* I must deliver reliable kernels for my physics group and I guess that they could afford a 10% loss in perf.

* I wrote research articles where the CPU peak performance ratio obtained by a kernel has some importance (40% is not as good looking as 50 % ;) )

More essentially I want to learn good patterns from the experienced user !

P.S. Sorry for the ugly indent style of the code (I copy past it from emacs...)

 

 

 

 

 

0 Kudos
RafSchietekat
Valued Contributor III
1,131 Views

About #12: If the task, or in this case the parallel_for Body, gets and uses the TLS memory and exits without having recursed to another use of the TLS, all is well. But even though the TLS is invisible from another thread (normal concurrency), the code is still non-reentrant for same-thread recursion, either the form you would expect (but also still have to look out for), or the one you don't expect: if the Body, or code it calls, either written specifically for the Body or reused from a library maintained by somebody else (or perhaps yourself with a different hat and on a different day), internally uses a TBB algorithm, this algorithm may at some point have to wait for some stolen work to finish, and may then steal other work, which may have been spawned by other work stolen from the same parallel_for, which may use the same Body, which uses the same TLS… So, in some convoluted and unexpected way (it wasn't my first thought, either), the Body recurses to itself, and enumerable_thread_specific by itself does not have a stack to save the previous content and restore it later. You could of course write around that, to provide the stack semantics yourself, or encapsulate that in a class with a different interface that automatically provides stack semantics, but you have to do the work and think about it. And the code has to be maintained to keep it crypto-recursion-free or crypto-recursion-safe.

Is that really worth it to save part of "a bit" of difference? Maybe at this point Laurent could quantify "a bit" (absolute and relative difference between option 2 and absolute best option 3): I think it will be far less than the point at which this might be considered a useful pattern, making its use here more like an anti-pattern instead.

(Added) I see he has done so in the meantime... Apparently the code is for reuse, so it does deserve some extra TLC for maximum performance. But how do the numbers change with a larger grain size, starting with 2 perhaps? And with values for work that is representative of real use?

(Added) I also see that it's a rather larger workspace than I had thought, at 2^21 elements, or 8 MB. I don't know much about the scalable allocator's performance here (it used to be that this was just shipped off to plain old malloc()). Another issue is initialisation: isn't the vector then filled with all-zero elements, and how much time does that take? Perhaps you could also try a plain array (still dynamically allocated) to get around that and see whether it makes a difference?

0 Kudos
jiri
New Contributor I
1,131 Views

Thanks for the explanation. Clearly, if you let go of the thread, you can get into serious trouble with the TLS. It would be nice to have a Task Local Storage :-).

(Edit) Just to make it clear, I am aware of the fact that "Task Local Storage" does not make sense, because it just means data of the task object... ;-)

Regarding the vector initialization - the values in the vector (using the vector(size_t) constructor) will be default-initialized, which should really set them to zero and (in this case) waste time. As far as I know, there is no way to avoid some initialization with the standard vector. Changing vector for something else (e.g., a dynamically allocated array) could improve the performance, especially for the non-TLS version.

0 Kudos
RafSchietekat
Valued Contributor III
1,131 Views

I've been thinking about possible context-local storage, associated with a task group context subtree. CLS<TLS<vector<float>>> could be the right mix here (note that the other way around wouldn't work), if the performance gap cannot be closed otherwise.

Maybe the initialisation can be avoided by overriding allocator::construct()?

0 Kudos
jiri
New Contributor I
1,131 Views

Would the context local storage also solve the problem created when the task hands over the thread to the scheduler which then runs a different task from the same parallel_for? It seems to me it would only deal with the composability, assuming that the context is not reused to run the nested algorithm.

So far, the only general solution I can think of is something like TLS<list<pair<bool,vector>>>, which would maintain a list of workspaces per each thread and a bool flag that specifies whether that particular vector is being used. The functor would get the list from TLS and find the first unused vector or add a new one at the end. To make it exception-safe, the flag shoud probably not be set directly by assignment in the code, but rather by something similar to the scoped lock used by TBB mutexes. In the ideal case, the list will only contain one workspace for each thread. It would grow if there is a "collision". So far I couldn't find a "hole" in this approach, but it sure is ugly :-).

Overriding allocator::construct is an interesting idea, but it seems that for example the latest MSVC compiler does not call it for the vector elements, but only invokes in-place new on them. Unfortunately, I currently don't have the latest C++ standard to check what it says about the issue.

0 Kudos
RafSchietekat
Valued Contributor III
1,131 Views

Paragraph by paragraph:

You're right, I got a bit carried away there. (I think there's a use for something like CLS, but it needs some more thought.)

A thread-local pool of workspaces might be as you propose, or something that uses the notion that the workspaces are activated as a stack even if they might be used out of order by stolen work. But it's still complicated, so let's hope a slightly coarser granularity and/or avoiding initialisation will make all this unnecessary?

I was looking at a recent draft, but have no idea about the practicality of such an override.

(2013-12-12 Corrected) "was getting"->"got a bit" (right?)

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,131 Views

I think it is important to insert an FYI here.

Not all threading paradigms are equivalent with respect to Thread Local Storage (this is not inherently clear).

I know this is a TBB forum, but we programmers tend to port code. Should you port to Intel Cilk+ be aware that the software thread that continues after a parallel region (call/construct) is not necessarily the same thread that initiated the parallel region (a stack swap may be involved). The impact of this is the TLS memory prior to the entry to the parallel region (call/construct) may reside in a different thread context following the parallel region (call/construct). In TBB, OpenMP, pthreads the resume is run on the initiating thread. This is one of those things that might bite you some day. You might want to place an assert in there as well as for __DEBUG__ build insert a serial number (e.g. __rdtsc) into the object (e.g. vector) and a copy into a scope visible variable, do this prior to creating a parallel region, then verify local copy == TLS copy upon resumption (insert a comment as to why you do this, as the next misguided programmer might yank this out thinking it unnecessary).

Jim Dempsey

0 Kudos
RafSchietekat
Valued Contributor III
1,131 Views

Then you'll probably have to comment other constructs as well. A recursive mutex should be fun to watch (once): lock, do something parallel, lock again… and you're deadlocked.

0 Kudos
jiri
New Contributor I
1,036 Views

Moving stacks (e.g., with fibers) opens up a nice selection of minefields to explore, so when porting to such environment you certainly have to be really careful and TLS is an obvious thing to watch. The recursive mutexes are a good example of one issue that is better concealed.

Would rdtsc be really a good serial number here? It seems to me this could fail to detect stack movement in some (extremely rare) cases...

0 Kudos
Reply