Solved: How to control the order of processede elements ?

Bartlomiej · ‎05-28-2009

Hello,

I have one more question. I have the following problem:

I process several elements - possibly in parallel,
new elements are added during the execution,
to be short it's something, where a parallel_do would seem quite appropriate, but...

... the problem is, it is crucial for the efficiency (not for correctness) that some of the elements are processed earlier than another ones.
If I were using a lower-level API, like Pthreads, I'd do it like that:

there is a shared priority queue - either lock-based or lock-free,
each of the threads gets the ``smallest'' element and possibly puts some new elements to the proper place.

Doesn't TBB have any tool to do it, without writing ``from scratch'' ? AFAIK parallel_do does not allow to control the sequence of assigned jobs like that (or is there a way to do it?).

Thank you in advance and best regards.

RafSchietekat · ‎05-29-2009

"But in this case there should be a mechanizm to migrate jobs from more loaded threads to less loded (or even idle ones) - as it is very difficult to forsee how difficult a job will be to finish."
That's handled in my suggestion: register new work in the feeder, so that ideally it will be handled locally, but otherwise, if another thread stole the task (the job migrated, as you call it), it still knows where to get some real data because it knows the specific queue. Except that this will happen one at a time (not the most efficient)... but maybe you can take more than one item from the queue at a time, and allow feeder tasks to find an empty queue if that happened. Should work, I think, unless you can find something simpler, which would be interesting but not that surprising.

View solution in original post

RafSchietekat · ‎05-28-2009

Normally "efficiency" and "shared" don't go together, so you might want to revise that assumption, otherwise I don't see any reason not to use a locked shared std::priority_queue together with parallel_while (not sure why it got deprecated) or pipeline (also easy to use for this purpose, not deprecated, and parallel_while/parallel_do is all about creating new local work which is not relevant here anyway).

If you can revise the assumption, you might want to consider thread-local priority_queue objects: insert new work into the local priority_queue, and maybe use the feeder to make sure nothing gets lost (each time a non-specific token for a specific queue, letting the queue decide the order). I would still lock my accesses to the queues, at least in the beginning and just to be sure, because redundant locking within a thread is quite cheap anyway (right?). Well, it's just a quick initial idea for your inspiration, and there may well be a better way to do this, which I hope you would then describe here.

(Added) Sorry, a bit too quick here, the lock on the local priority_queuewould not be optional (feeder work may get stolen, I presume).

Bartlomiej · ‎05-28-2009

Thanks for the reply.

Yes, I know how to do it that way - I can create several threads using either tbb_thread (or possibly even parallel_for) and order all of the trheads to do a do...while loop, operating on a shared data structure.
My concern is that TBB is so that it such operations could be automated. And, actually, if the order of processed jobs wasn't signifficant, a parallel_do (and possibly also other concepts) would work perfectly there.

That's why I don't like too-high-level APIs ;-) much - if you want to do something that was expected by the library developer, you can do it very simply, indeed. But if you want to do something only a bit different, it might make you to write the whole program in a completely different way or use strange tricks.

I think this might evolve to a suggestion for future TBB API - add a variant of parallel_do, where the ``input stream'' and ``feeders'' can be influenced by the programmer somehow, e. g. the programmer can change if feeders add new elements to the main ``input stream'' or to local ones and how is the ``input stream'' implemented (abstract class ?).

I know the idea might be immature yet, but IMHO it might grow up with some help. ;-)

RafSchietekat · ‎05-28-2009

"write the whole program in a completely different way or use strange tricks"
aka. "programming" :-)

Library or even language design is about capturing and supporting the patterns that do the most good (in TBB's case: best performance relative to programming effort, I suppose), otherwise the product becomes bloated. Do you have the experience to allow you to claim you have identified such a worthy pattern? Meanwhile, I guess it's all about doing the best we can with the tools at our disposal, comparing notes, and building that experience. The rest is pretty much idle speculation, although dreaming can be fun, too.

In this case, you have not even addressed whether the priority_queue must be shared or can be subdivided, and why, so...

Bartlomiej · ‎05-29-2009

"write the whole program in a completely different way or use strange tricks"
aka. "programming" :-)

Heh, indeed. :-)

Do you have the experience to allow you to claim you have identified such a worthy pattern?

Well, that's a good question. :-) I have "some" experience, but does it allow me to do anything ? I dunno. ;-)
Also, after writing the previous post I started to wonder - would it really be simpler to niggle in the underlying mechanizm than to write the thing from scratch. I'm not sure...

Meanwhile, I guess it's all about doing the best we can with the tools at our disposal, comparing notes, and building that experience.

And that's what I hope for. :-)

Quoting - Raf Schietekat

In this case, you have not even addressed whether the priority_queue must be shared or can be subdivided, and why, so...

Yes, so the implementation, where each thread has it's own priority queue of jobs is a solution, too. But in this case there should be a mechanizm to migrate jobs from more loaded threads to less loded (or even idle ones) - as it is very difficult to forsee how difficult a job will be to finish.

Best regards

RafSchietekat · ‎05-29-2009

"But in this case there should be a mechanizm to migrate jobs from more loaded threads to less loded (or even idle ones) - as it is very difficult to forsee how difficult a job will be to finish."
That's handled in my suggestion: register new work in the feeder, so that ideally it will be handled locally, but otherwise, if another thread stole the task (the job migrated, as you call it), it still knows where to get some real data because it knows the specific queue. Except that this will happen one at a time (not the most efficient)... but maybe you can take more than one item from the queue at a time, and allow feeder tasks to find an empty queue if that happened. Should work, I think, unless you can find something simpler, which would be interesting but not that surprising.

Bartlomiej · ‎05-30-2009

Quoting - Raf Schietekat

That's handled in my suggestion: register new work in the feeder, so that ideally it will be handled locally, but...

Oh, *that* was the idea in yor first reply ! Actually, I didn't get it then.
Now, that I understood, it sounds likea very good idea.
The only thing I need now is a good heuristic to choose if put a new job to the private queue of the thread or to the feeder. But that's very problem specific and I have to work it out myself. ;-)
Thanks !
And if I come up with some general ideas that will seem good, I'll be back to tell. ;-)

RafSchietekat · ‎05-30-2009

Maybe I should be more precise...

It seems like you have acknowledged that your program would run best with parallel_do/parallel_while with priority queue-based feeders (or parallel_while equivalent). It's not just API convenience that should determine that, of course: parallel_... gets new work from a shared location only reluctantly, but feels happiest when it can work all by itself, using locally created work, because then there is no inter-cache traffic to slow things down, so that's what the feeder is all about.

Short of reworking that algorithm to literally do that and then chase its revisions, and without sufficient evidence (yet?) to support this as a generally useful pattern, I would suggest to just simulate it. So what I would do is not "choose if put a new job to the private queue of the thread or to the feeder" as you say, but insert work W into a std::priority_queue Q in thread-local storage, and at the same time also insert an item I(Q) into the feeder. You can start this by using a shared priority_queue, whose work registrations you feed into parallel_do/parallel_while. Note that I only records Q, not W, because B::operator() would let Q decide which item to process next. The idea is that, most of the time, Q will be local, and parallel_... will be happy.

That's all from my side.

Bartlomiej · ‎06-05-2009

Quoting - Raf Schietekat

I'm sorry - I was out of web access for a few days.
And I'm sorry, but not sure if I understand you correctly:

So for each of the threads we have a priority queue - but all of them are shared, so they cannot lie in TLS. They are shared, but each of them is dedicated to one of the threads.
The "jobs" added by feeders - I(Q)'s - are some references (yes ?) to the queue in which the actual job description was put.

I'm afraid I misunderstood you again, but it might work, anyway. ;-)

Thanks !

RafSchietekat · ‎06-05-2009

"So for each of the threads we have a priority queue - but all of them are shared, so they cannot lie in TLS. They are shared, but each of them is dedicated to one of the threads."
Well, yes, if that works well for your program. The TLS doesn't really hold a queue, of course, just a pointer to one that's dynamically allocated. And if it's a std::priority_queue it would have to be bundled with its own mutex.

"The "jobs" added by feeders - I(Q)'s - are some references (yes ?) to the queue in which the actual job description was put."
That's what I meant. Be sure to report back how it works out, if you decide to try this!

jimdempseyatthecove · ‎06-05-2009

Since you have control over creating your own iterator, your iterator can be written to pick the next item from a prioritized collection of (for the lack of better words) queues. Depending on the application, one of the easiest ways to do this would be through the use of a series of ring buffers. One at each priority level. The iterator picks starting at the lowest level and migrating up levels when level empty. The fill starts at desired level, and should level be full it migrates up a level. When fill fails at all levels then fall back to a dynamic queue. Ring buffers can be managed with a single CAS and without locks.

Jim Dempsey

Bartlomiej · ‎06-08-2009

Dear Raf,
I objected against using TLS for a while as I wasn't sure if a pointer to an object in TLS would be valid for another thread.
But I was probably wrong as it's the same process and the same address space.
Anyway, I'm not sure what would be best wrt caching - shared memory or TLS...

Quoting - jimdempseyatthecove

... Depending on the application, one of the easiest ways to do this would be through the use of a series of ring buffers. One at each priority level...

Jim Dempsey

Dear Jim,
Thanks. It would work perfectly, but in my problem I don't have a finite a priori known number of priorities.
I consider a branch-and-bound optimization method and the priority is the lower bound for the objective in a region; there is a continuous range of its possible values (well, it's not as the set of floating-point numbers is actually discrete, but let's neglect it ;-) ) and we can only compare priority values of boxes to find a place to put a box.

Best regards

RafSchietekat · ‎06-08-2009

"I objected against using TLS for a while as I wasn't sure if a pointer to an object in TLS would be valid for another thread."
I'm always in favour of caution. But you would not be accessing another thread's TLS: the TLS merely holds a pointer to the thread's queue, which pointer is also passed separately (inside the I(Q) item), and probably held centrally for later cleanup. The queues would be dynamically allocated, and would occasionally be accessed from another thread, hence the locks.

"But I was probably wrong as it's the same process and the same address space."
Unfortunately, memory semantics issues can get in the way of that argument, although locking a mutex, in addition to providing algorithmic integrity, takes care of those memory-semantics issues for you.

"Anyway, I'm not sure what would be best wrt caching - shared memory or TLS..."
The TLS is just to identify the local queue, not to hold it. The queues themselves are in shared memory anyway. My suggestion assumes that, most of the time, threads access their own queue (good for the cache), unless they become unemployed and need to go out stealing (bad social security out in TBB land...).

(Correction 2009-06-10) Missing logical step added.

jimdempseyatthecove · ‎06-09-2009

Quoting - bartlomiej

Dear Jim,
Thanks. It would work perfectly, but in my problem I don't have a finite a priori known number of priorities.
I consider a branch-and-bound optimization method and the priority is the lower bound for the objective in a region; there is a continuous range of its possible values (well, it's not as the set of floating-point numbers is actually discrete, but let's neglect it ;-) ) and we can only compare priority values of boxes to find a place to put a box.

Best regards

Not having intimate knowledge of your data, how and when it is created, lives and flows, then dies, it is somewhat difficult to offer a good solution.

When the work for each item is relatively large and when the list of items changes intermittentlyand when you want to prioritize those items I would suggest you consider an old stand by: the singly linked list. It is trivial to construct an n-way iterator to have n-threads pick off the prioritized list. The reorganization (add, delete, relocate) can occur in the serial portion of your process loop without use of interlocks i.e. prior to, or subsequent to, the process all items in list task. Usualy the reorganization decision is made outside the process loop so having the reorganization performed by one thread may not be as bad as it might first appear.

If you have an unwarranteddistain for linked lists then the suggestion would be to have two vectors of pointers to the objects. One vector containing the current working set (in order of priority) and the second vector on stand-by to be built with the reorganization (add, delete, relocate), plus a state variable to indicate which vector is active and which is stand-by. You want to keep a second vector for stand-by in order to eliminate deallocate and allocate of the vector each time you make a change to the list (and eliminate reallocation during population). Also note, an add of lowest priority can occur without swaping vectors (as well as swap entries can be use to tweak a priority one slot).

Note, insertion and deletion of linked list is much faster than that of repopulating the stand-by vector.

Also note, the standard parallel_for, presumably with granularity of 1 for processing in priority order, will introduce interlocked operation for each step whereas an n-way pick will not. You could use a n-way pick in a parallel_do but not in a parallel_for.

We perform a non-interlocked n-way pick in the QuickThread parallel_list.

Jim Dempsey

Bartlomiej · ‎06-10-2009

Quoting - Raf Schietekat

Yes, you're right. Nevertheless, it just struck me that TBB has no API for TLS no for thread identifiers so it'll still be fun to code it...

Quoting -Jim Dempsey

Yes, of course !But, well I don't want you to dothe work that's my duty;-) , just hoped for some hints (and got several - thanks). As I already havewritten before - yes if I used e.g. Pthreads or Boost I'd do it exactly, like you're saying (I already have serial implementatiom with the priority queue implemented as a linked list). I just hoped to:
- improve the program a bit,
- do it better thanks to features of TBB,
- and (well, I admit it ;-) ) play a bit with these cool features.

Actually, I think the way Raf suggested is very well - posibly I'll check a few variants.

Best regards

Alexey-Kukanov · ‎06-10-2009

Quoting - bartlomiej

Nevertheless, it just struck me that TBB has no API for TLS no for thread identifiers so it'll still be fun to code it...

For thread-local data, you may consider tbb::enumerable_thread_specific class (ETS), which isavailable in most recent open-source development updates. It will soon be released in binary form as well. The documentation for it is being prepared, so for now ask any questions about ETSin the forum.

jimdempseyatthecove · ‎06-10-2009

Scissors, Paper, Rock,

There is no one solution to fit all circumstances. Because one feature is missing from one potential solution does not mean that solution should be rejected. You are free to use TLS in TBB.

Just because TBB does not expressly show you how and why to use TLS does not mean you cannot use TLS (and therefore presents the objectionable point). Same true for thread identifier

volatile long lastThreadNumber = -1;
__declspec( thread) int ThreadNumber = -1;

int get_my_thread_number()
{
int tn = (int)ThreadNumber; // copy from TLS to local
if(tn >= 0) return tn; // test against local copy/return local copy(faster)
// 1st time call, get our thread number
// caution, this impliesif/when thread terminated, numberretires
ThreadNumber = tn = (int)InterlockedIncrement(&lastThreadNumber);
return tn;
}

Same for queues etc...

This said, you must understand that TBB is a Tasking paradigm and not a Threading paradigm.

Task: { begin, compute, exit }
TBB thread: { Task , Task, Task ....}

Typical thread: { begin, while(!Done) { WaitForSingleObject(...), doWork() } }

i.e the typical thread model incorporates blocking/suspension/waking inside the task

Whereas in Tasking model and blocking/suspension/waking between the task

Jim Dempsey

Bartlomiej · ‎06-13-2009

Quoting - Alexey Kukanov

Oh, I'd didn't know about this ETS ! Must have a look at it.
Thanks a lot !

Quoting - jimdempseyatthecove

Scissors, Paper, Rock,

There is no one solution to fit all circumstances. Because one feature is missing from one potential solution does not mean that solution should be rejected. You are free to use TLS in TBB.

Yes, but it would not only bring us out of TBB, but alsomake us use non-portable solutions. For example on GCC that I use, I'd have to use __thread instead of __declspec(thread); other compilers would have yet another syntax for it as there's no C++ standard for TLS.

As for threads vs tasks - yes, I know it "in theory", but possibly I sometimes get a bit confused, as I'm used to threads rather than tasks. But in this case it seems reasonable to have some data related to threads, so... um... inherited by subsequent tasks mapped to the same thread, isn't it so ?

PS. Wow ! The topic is now marked as hot ! I'd ever expect that and don't know if to feel proud or ashamed. ;-)

jimdempseyatthecove · ‎06-15-2009

Quoting - bartlomiej

Yes, but it would not only bring us out of TBB, but alsomake us use non-portable solutions. For example on GCC that I use, I'd have to use __thread instead of __declspec(thread); other compilers would have yet another syntax for it as there's no C++ standard for TLS.

As for threads vs tasks - yes, I know it "in theory", but possibly I sometimes get a bit confused, as I'm used to threads rather than tasks. But in this case it seems reasonable to have some data related to threads, so... um... inherited by subsequent tasks mapped to the same thread, isn't it so ?

PS. Wow ! The topic is now marked as hot ! I'd ever expect that and don't know if to feel proud or ashamed. ;-)

Does not every "practical" C++ compiler support a PreProcessor?

// comment/uncomment following line or place in MAKE or Project
#define USE__declspec
#ifdef USE_declspec
#define _THREAD __declspec(thread)
#else
#define _THREAD __thread
#endif

Another option is to create your own TLS. A generic technique that works on all platforms.

A technique that works well, which I will outline below, I've done it before and know it works, but I will not attest that the outline is correct... :~|

Assuming running TBB (you can adapte this to OpenMP) at beginning of the application initiate a task to be run by all threads of the thread pool. This task, being run early in the application will be run by each thread with its stack at the emptyest level. The task increments (atomic) a static counter to produce a cardnal number in the range of 0 to n-1. This number is use as an index into a static array. The current stack pointer (address of dummy local variable) is written into the static array. i.e. you have an array (diminisioned at .gt. maximum possible threads) of arbitrary stack pointers near emptiest of stack level for each thread. The threads, now knowing their thread number may also do initialization of private data. Then return from get cardnal thread number and init of TLS.

During run of other parts of program, to get TLS (in function) you first must get the thread number. To to this your get thread number function simply compares the address of a local variable to the addresses of the table of near empty stack values. The cell at index which produces nearest address above the address of the local variable is the cardnal thread number. This cardnal number can now be used to indexyour TLS structure (or pointers to your TLS structures).

This is not as fast as using the compiler's TLS technique, so your preprocessor code should be conditionalized to use the generic technique ONLY when the compiler technique is not available.

Jim Dempsey