Solved: Re: Need help understanding thread pool architecture

Steve_Nuchia · ‎06-22-2009

I'm willing to bet this has been answered many time in many forms but I could't find anything that helped me, neither in the documentation nor by searching this forum.

The TBB docs are written from the perspective of a single-threaded program entering parallelizable sections (possibly nested) and emerging from them again. There's language about the requirement that each thread entering a TBB parallel construct initializing a task_sceduler_init object but nothing about what effect that has.

I've got a couple of situations that don't exactly fit the paradigm. Take the more general one: a library that may be called from a multithreaded program and wants to use TBB internally. We may be called from a thread with an existing task scheduler but from outside any TBB task, we may be called from inside a tbb task, and we may be called on a thread that's never heard of TBB before.

Further complicating matters, I'm working in Windows where all threads are not created equal. There's a faily hideous matrix of things thathave per-thread initialization and periodic maintenance obligations.

I know, use the source, Luke. What I'm hoping for here isn't so much an insight into the TBB mechanism as the phrase that whacks my head into alignment with the authors' heads.

Specific issues:

If two independent user threads call into a module that uses TBB internally, will the tasks created by the called entry points be sceduled against each other? If so, is there any direct way to influence how they are scheduled?

If there's any notion of worker thread initialization hooks, I didn't see it. Should there be? Is there an idiom for it?

We're considering implementing a structure where we wrap the tbb::parallel_foo templates with versions that pass their parameters from whatever user thread they were invoked on into a TBB thread pool. The task trees so created are meant to have arbitrarily overlapping lifetimes and no direct interaction with one another. What if any gotchas do I need to be looking out for.

thank you,
-swn

Alexey-Kukanov · ‎06-23-2009

Quoting - Steve Nuchia

Specific issues:
...

Some information related to your questions:

- I think I explained a few times in the forum how task_scheduler_init works, and that initializing TBB for a second time in a thread has low overhead. Thus the solution Peter suggested is what we recommend.

- in the next version of TBB, there will be support for automatic initialization. So you will not need to create task_scheduler_init on each call for sake of threads that did not yet initialize TBB explicitly. Still I would recommend to keep a global init object that covers DLL lifetime, to ensure TBB worker threads remain alive.

- if two independent user threads (we call them "masters") use TBB concurrently, they will share the TBB workers. Whatever master publishes its tasks first, will get the workers; but once a worker completed the piece of work stolen earlier, it will seek for another piece to steal, and the second master will be considered.The masters will most of the time work on their own tasks; but if the task pool becomes empty while stolen pieces of job are not yet completed, a master will also go and steal, possibly from another master. There is no direct way to influence stealing.

- for hooks, learn task_scheduler_observer.

- I am not sure what do you want to achieve with the above mentioned wrappers over TBB parallel algorithms. Could you elaborate a little?

View solution in original post

pvonkaenel · ‎06-23-2009

Quoting - Steve Nuchia

I'm willing to bet this has been answered many time in many forms but I could't find anything that helped me, neither in the documentation nor by searching this forum.

The TBB docs are written from the perspective of a single-threaded program entering parallelizable sections (possibly nested) and emerging from them again. There's language about the requirement that each thread entering a TBB parallel construct initializing a task_sceduler_init object but nothing about what effect that has.

I've got a couple of situations that don't exactly fit the paradigm. Take the more general one: a library that may be called from a multithreaded program and wants to use TBB internally. We may be called from a thread with an existing task scheduler but from outside any TBB task, we may be called from inside a tbb task, and we may be called on a thread that's never heard of TBB before.

Further complicating matters, I'm working in Windows where all threads are not created equal. There's a faily hideous matrix of things thathave per-thread initialization and periodic maintenance obligations.

I know, use the source, Luke. What I'm hoping for here isn't so much an insight into the TBB mechanism as the phrase that whacks my head into alignment with the authors' heads.

Specific issues:

If two independent user threads call into a module that uses TBB internally, will the tasks created by the called entry points be sceduled against each other? If so, is there any direct way to influence how they are scheduled?

If there's any notion of worker thread initialization hooks, I didn't see it. Should there be? Is there an idiom for it?

We're considering implementing a structure where we wrap the tbb::parallel_foo templates with versions that pass their parameters from whatever user thread they were invoked on into a TBB thread pool. The task trees so created are meant to have arbitrarily overlapping lifetimes and no direct interaction with one another. What if any gotchas do I need to be looking out for.

thank you,
-swn

I had similar questions about how to use task_scheduler_init in a DLL in this thread: http://software.intel.com/en-us/forums/showthread.php?t=65576. I ended up creating a task_scheduler_init instance in the DllMain() on process connect and terminate it on process detach. Then in each DLL function call that uses TBB, create a local task_scheduler_init instance (that will automatically destruct at the end of the call) in-case a background thread is calling it (should be a very cheap call).

I have no idea how to control the scheduling of tasks dispatched from different threads that may be running concurrently. Considering a 4 core machine, the first task_scheduler_init will create 3 worker threads. If the main thread and a background thread each dispatch a block of tasks, then they will fight for the 3 worker threads probably based on who dispatched first, but the main/bg threads will still have their own independent thread priorities. So, I guess you have a fractional control based on the disptcher thread's priority.

Peter

Steve_Nuchia · ‎06-23-2009

Thank you, that's a big help. Now I'm reading up on all the restrictions on what you can do in DllMain and it's pretty terrifying. Can you point me to an example or pattern that "threads" the needle? (ha ha).

Alexey-Kukanov · ‎06-23-2009

Quoting - Steve Nuchia

Specific issues:
...

Some information related to your questions:

- I think I explained a few times in the forum how task_scheduler_init works, and that initializing TBB for a second time in a thread has low overhead. Thus the solution Peter suggested is what we recommend.

- in the next version of TBB, there will be support for automatic initialization. So you will not need to create task_scheduler_init on each call for sake of threads that did not yet initialize TBB explicitly. Still I would recommend to keep a global init object that covers DLL lifetime, to ensure TBB worker threads remain alive.

- if two independent user threads (we call them "masters") use TBB concurrently, they will share the TBB workers. Whatever master publishes its tasks first, will get the workers; but once a worker completed the piece of work stolen earlier, it will seek for another piece to steal, and the second master will be considered.The masters will most of the time work on their own tasks; but if the task pool becomes empty while stolen pieces of job are not yet completed, a master will also go and steal, possibly from another master. There is no direct way to influence stealing.

- for hooks, learn task_scheduler_observer.

- I am not sure what do you want to achieve with the above mentioned wrappers over TBB parallel algorithms. Could you elaborate a little?

Steve_Nuchia · ‎06-23-2009

Quoting - Alexey Kukanov (Intel)

- I am not sure what do you want to achieve with the above mentioned wrappers over TBB parallel algorithms. Could you elaborate a little?

Very helpful post, summarizing what I'd gleaned elsewhere and filling in some gaps. Thank you!

In Windows, as is probably true in most GUI frameworks, all threads are not created equal. What I'm trying to achieve is, generically, segregation of work that a "master" can or must do from work that can or should be done by workers.

Specifically: the master must contunue to "pump messages" or the world stops working, if the master happens to be the main thread of the application. Also, the RPC mechanisms underlying COM and its successors work only if you've goine through the proper initialization rituals on the thread making the call.

Having the master act as foreman, sharing the tasks with the workers creates a lot of constraint and requirement conflicts. Keeping them separate is one approach to resolving those conflicts. Others are (using the "hook" concept) ensuring that all workers are qualified to use all the APIs and dynamically detecting whether we're on the master or an ordinary worker thread and somehow "doing the right thing" inside (every!) task's operator() function.

Isn't legacy programming fun?

Steve_Nuchia · ‎06-23-2009

Also, I'm still looking for a pattern that will allow code resident in a DLL to maintain a thread pool over its lifetime and safely clean up when the DLL is unloaded, regardless of which mechanism(s) are used by the host process to load and unload the library. According to Microsoft's own documentation this is intractable in general so I guess my expectations are inherently limited here.

pvonkaenel · ‎06-23-2009

Quoting - Steve Nuchia

Thank you, that's a big help. Now I'm reading up on all the restrictions on what you can do in DllMain and it's pretty terrifying. Can you point me to an example or pattern that "threads" the needle? (ha ha).

I'm using a DllMain that looks like the following. Note that you can probably skip the ippStaticInit() call unless you're statically linking with the IPP library.

[cpp]tbb::task_scheduler_init g_tbbinit(tbb::task_scheduler_init::deferred);


BOOL APIENTRY DllMain( HMODULE /*hModule*/,
                       DWORD  ul_reason_for_call,
                       LPVOID /*lpReserved*/ )
{
    switch (ul_reason_for_call) {
        case DLL_PROCESS_ATTACH:
            ippStaticInit();
            g_tbbinit.initialize();
            break;
        case DLL_THREAD_ATTACH:
        case DLL_THREAD_DETACH:
            break;
        case DLL_PROCESS_DETACH:
            g_tbbinit.terminate();
            break;
    }
    return TRUE;
}
[/cpp]

Alexey-Kukanov · ‎06-23-2009

Freeing the main application thread to do message pumping etc., and delegating all the heavy work to separate thread(s) that could in turn utilize TBB algorithms or whatever else - this makes perfect sense to me. If you just meant that, I have no further questions :)

Quoting - Steve Nuchia

Also, I'm still looking for a pattern that will allow code resident in a DLL to maintain a thread pool over its lifetime and safely clean up when the DLL is unloaded, regardless of which mechanism(s) are used by the host process to load and unload the library. According to Microsoft's own documentation this is intractable in general so I guess my expectations are inherently limited here.

Right. And, as Peter's experience with dynamic loading and unloading of TBB-dependent DLL suggests, we have some problems with correct thread shutdown in this scenario. I have heard an opinion (supported by reference to an MS KB article, which I unfortunately lost) that the most safe way to do such cleanup on Windows is to signal worker threads that they should complete the work, release all resources etc, and park themself in e.g. an infinite loop; and after they signal back their completion, just kill them. This is not yet implemented in TBB, though we might eventually get there if nothign else works.

Steve_Nuchia · ‎06-23-2009

Quoting - Alexey Kukanov (Intel)

I have heard an opinion (supported by reference to an MS KB article, which I unfortunately lost) that the most safe way to do such cleanup on Windows is to signal worker threads that they should complete the work, release all resources etc, and park themself in e.g. an infinite loop; and after they signal back their completion, just kill them. This is not yet implemented in TBB, though we might eventually get there if nothign else works.

The document that lays that outcan be downloadedfrom http://www.microsoft.com/whdc/driver/kernel/DLL_bestprac.mspx
The relevant section is on page 7. Well, its pretty much all relevant in the piecemeal Microsoft documentation tradition, but page seven is the part you've lost track of.

pvonkaenel · ‎06-23-2009

Quoting - Steve Nuchia

Also, I'm still looking for a pattern that will allow code resident in a DLL to maintain a thread pool over its lifetime and safely clean up when the DLL is unloaded, regardless of which mechanism(s) are used by the host process to load and unload the library. According to Microsoft's own documentation this is intractable in general so I guess my expectations are inherently limited here.

If you want to dynamically LoadLibrary()/FreeLibrary() on the DLL that uses TBB, I bumped into a dead-local case which I was able to fix by modifying the Arena::terminate_workers() method in tasks.cpp (commercial aligned open source version of TBB). Look for the call to WaitForSingleObject() and replace INFINITE with some timeout (I use 300 ms) and look for the timeout case.

[cpp]            DWORD status = WaitForSingleObject( w.thread_handle, 300 );
            if( status==WAIT_FAILED ) {
                fprintf(stderr,"Arena::terminate_workers: WaitForSingleObject failedn");
                exit(1);
            } else if ( WAIT_TIMEOUT == status ) {
                TerminateThread(w.thread_handle, -1);
            }
[/cpp]

Steve_Nuchia · ‎06-23-2009

Quoting - pvonkaenel

If you want to dynamically LoadLibrary()/FreeLibrary() on the DLL that uses TBB, I bumped into a dead-local case

Thank you, that's very helpful too. So far I'm using the precompiled binaries but if I have to I'll build from source and incorporate your suggested workaround.

It's not that I "want to" dynamically load/unload anything. I'm shipping (among other things) a library that may be called from other libraries that may be loaded dynamically. It's out of my hands.

Where I ran into the deadlock was with a call to the registerserver entry point leading to destruction of an initialized TBB pool from DllMain. I could work around that particular case but it seems to be the tip of an iceberg.

pvonkaenel · ‎06-24-2009

Quoting - Steve Nuchia

Thank you, that's very helpful too. So far I'm using the precompiled binaries but if I have to I'll build from source and incorporate your suggested workaround.

It's not that I "want to" dynamically load/unload anything. I'm shipping (among other things) a library that may be called from other libraries that may be loaded dynamically. It's out of my hands.

Where I ran into the deadlock was with a call to the registerserver entry point leading to destruction of an initialized TBB pool from DllMain. I could work around that particular case but it seems to be the tip of an iceberg.

The deadlock you saw may be the same as the one I bumped into. Do you see it when the final task_scheduler_init is either destructed, or if you explicitly call task_scheduler_init::terminate()? If so, than that may be what I saw.

I was also a little nervous about switching from the commercial release to the open source, but was reassured that the commercial aligned open source released was the same as the pre-built binaries. So far I have not seen any problems other than the fixable deadlock case.

Peter

jimdempseyatthecove · ‎06-24-2009

Steve,

I haven't attempted this, the outline of a potential solution would be

In the DLL, in the place where you would instantiate a TBB thread pool, spawn a new thread to create the TBB thread pool. This new thread (given context information) can know the original calling thread to the DLL, and know if/when that thread terminates (normally or abnormally). The original call to the DLL returns to the caller and the caller continues independent of the TBB thread pool created on its behalf for future use. Now on subsequent DLL calls by the app using the DLL, should the call requireusage ofTBB, the TBB context is available for use. When the user call to the DLL returns, the TBB context (threads within that context)for that caller is place in a sleep mode with the control thread waiting on WaitForSingleObject. When the next TBB function call comes in to the DLL it issues a SetEvent to wake up thethread controlling the TBB pool threads (the calling thread could be put to work by the TBB control thread should you want to expend the programming effort).

Understand that there will be multiple TBB pools (one for each process using the TBB feature of your DLL).

The net effect of this additional thread is to become a Daemon between the Process and the TBB pool for the process withinyourDLL

Jim Dempsey

Steve_Nuchia · ‎06-24-2009

Jim,

that's the direction I'm heading. It doesn't really solve the termination problem, it just offsets it by one thread. But the "proxy master" pattern I'm working up as we speak addresses the housekeeping (message pumping) problem nicely and, for well-disciplined client processes, can make the termination problem go away.

jimdempseyatthecove · ‎06-24-2009

Quoting - Steve Nuchia

Jim,

that's the direction I'm heading. It doesn't really solve the termination problem, it just offsets it by one thread. But the "proxy master" pattern I'm working up as we speak addresses the housekeeping (message pumping) problem nicely and, for well-disciplined client processes, can make the termination problem go away.

Steve,

I gave my suggestion a rethought while driving into my morning coffee meeting. The suggestion I gave might not work as well as I first thought.

The termination problem could be handled by a singleindependent process launched once by the DLL used to monitor for ab-end of applications using the DLL. (a small complication)

The problem, as I hypothesize, is that an application TBB thread pool might require a/some static structures. If/When TBB is used totally within a DLL .and. if you have this/those static structures within the DLL then you will have problems with concurrent processes sharing of DLL (since there will be one instance of those static structures within the DLL). The solution in this case (assuming you want multiple processes to concurrently share your DLL) might be to require a small static stub in the application (process) which links to your DLL. This stub, is a process resident TBB context to be managed by the Daemon discussed in the earlier post. This may make programming of the TBB calls within the DLL a bit contrived (callerContext->parallel_for(...) or something like that). You would have to give this some thought as to how to re-use/adapt the current templates for this purpose. Potentially a template shell could be used.

This is one reason why a static library might be preferred over DLL. (or hybrid of static + DLL where the static portion contains the TBB scheduler and TBB dispatching calls within the DLL are made using callbacks into the application/process).

Jim Dempsey

Steve_Nuchia · ‎06-24-2009

Quoting - jimdempseyatthecove

The termination problem could be handled by a singleindependent process launched once by the DLL used to monitor for ab-end of applications using the DLL. (a small complication)

OK, it's pretty clear at this point we have divergent vocabulary, if not divergent assumptions. My situation is a Windows-only shrink-wrapped software product with both a primary executable and user access to DLLs.

In the Windows world, unless you go to exceptional lengths, each process that uses a DLL does so in a way that is independent of all the others. The system tries to avoid using different relocations of the code for different processes so it can share the pages under the hood but the instance of the code in each process is unaware of the instances that may be running in other processes.

Monitoring for ab-end (again, vocabulary from a different world) would be more than a small complication. It would require writing what amounts to an automated debugger capable of hooking into any 32- or 64-bit process, determining what each tread in that process is meant to be doing, and detecting when it is deadlocked. It's not actually as hard as it sounds, one could probably script an existing debugging engine to first check whether a DLL named tbb.dll was loaded into the process's address space (and maybe whether it has the right entry points exposed to be a version of the tbb.dll we care about), then walk the stacks of all the threads and see if any of them have frames with IP values in that DLL, then ... well, it wouldn't be easy.

Complicating matters is the fact that I'm from your world, or at least a world in the same system. I'm by no means an expert on Windows systems programming. I rely on my colleagues here with a lot more time-in-grade than I have with these issues. Collectively we have a fair amount of experience with polymorphic multithreading (where each thread has a distinct role in the app) under Windows. I've got experience with threading for performance, though my background is mainly in hard realtime systems.

To be completely specific, the termination problem as I now understand it is this: If a library embodied in a DLL obtains a reference to the TBB thread pool and holds it into process shutdown, that reference gets released during the shutdown of the intermediate DLL. That operation happens under a system mutex known as the loader lock. If the reference release causes the thread pool to shut down, termination of TBB's worker threads cause the system to attempt to deliver notifications to all still-loaded DLLs. That notification attempts to obtain the loader lock on the worker thread, deadlocking the process.

This doesn't happen if you simply instantiate a task_scheduler_init object as a local in main(): it is destructed before main returns to the trampoline code so the thread pool is shut down before the process teardown sequence begins.

And you wondered why windows apps are so slow :-)

The only option I can see that will make this work as designed when called from a body of code that may be dynamically loaded is to require the caller to explicitly manage the lifetime ofany task_scheduler_init objects used by the intermediate library (the code that uses tbb directly). That can't be forced for arbitrary DLLs but it matches the design assumptions behind (in-process) COM server DLLs nicely.

The assumption underlying your post, that the thread pool is shared by multiple processes, would be ideal from a performance point of view but very difficult to implement. The address space in which a thread is running is not something your can just switch. In an operating system where you could divorce the scheduling aspects of a thread from the address space binding you could do that. I'm imagining something along the lines of the "fibers" concept but one level up in the implementation.

But doing so would create some security issues; you wouldn't want to do that except in an embedded kind of environment. I'm thinking primarily of liveness issues and covert channels but I haven't tried to exhaust the possible problems. It's hypothetical anyway for me: Windows isn't such a system.

The other possiblity is to go back to the suggesiton by pvonkaenel and use a timeout to break the deadlock when it happens. I find that distasteful but if it becomes the last problem I'm facing it's certainly better than requiring the user to terminate the process forcefully.

pvonkaenel · ‎06-24-2009

Quoting - Steve Nuchia

The other possiblity is to go back to the suggesiton by pvonkaenel and use a timeout to break the deadlock when it happens. I find that distasteful but if it becomes the last problem I'm facing it's certainly better than requiring the user to terminate the process forcefully.

I'm glad I'm not the only one who finds it distasteful, however, I dislike WaitForSingleObject(INFINITE) even more - there will be a case where infinite happens. I'm afraid the real solution is stated in one of the TBB source code comments just after my patch block:

FIXME: each scheduler should plug-and-drain its own mailbox when it terminates.

I think if that happens, then the WaitForSingleObject is no longer necessary, and the problem goes away. It looks like the developer (less stable) version of TBB alredy has that implemented, but due to the large number of changes between the commercial aligned and development version, I opted to stick with the commercial aligned and add my own force terminate patch.

I'd be much happier with a different termination fix if you have one.

Peter

jimdempseyatthecove · ‎06-24-2009

The assumption underlaying my post is each process using the DLL has an (or intends to have) an independent TBB thread pool.

The bug-a-boo (technical term) is (I believe) TBB has a static context. If the DLL contains this static context, then there can be only one instance of this static context (barring some quirky VM page manipulation) and therefore one thread pool and then therefore one app using the DLL. (not good, not what you want).

The trick then is how do you make a DLL that resides at one VM address within all VMs that share the DLL and have different static TBB structures. (Note, a DLL can be PIC and need not reside at a fixed VM offset)

One way identify the location of these structures in the DLL and clone copy these to new page of application overlaying this/these addresses. The O/S might frown on you attempting to do this but you might be able to do this with a device driver.

A second way, which I think is better,is to require all applications using your DLL to have a data block located at a fixed virtual address. Within this data block is/are the static TBB structures. Now the DLL points to fixed addresses within your application (not in the DLL). This data block need not be linked into the application. A VirtualAlloc, in an initialization call from the app to the DLL could do this. The only requirement of the app thenis to not have the load image (or interviening allocations)extend into this reserved space and for the VirtualAlloc to coordinate things with the heap manager such that it does not think those addresses are unused/available.

Since all the TBB context information is in the app (either on stack or in funky data block) then the app can crash without taking out the DLL. Now your focus of problem area is narrowed to the app (assuming your DLL is bug free).

Jim Dempsey

RafSchietekat · ‎06-25-2009

"The bug-a-boo (technical term) is (I believe) TBB has a static context. If the DLL contains this static context, then there can be only one instance of this static context (barring some quirky VM page manipulation) and therefore one thread pool and then therefore one app using the DLL. (not good, not what you want)."
This seems to be at odds with Steve Nuchia's (and my) understanding?

RafSchietekat · ‎06-25-2009

(Accidental double posting removed.)

Alexey-Kukanov · ‎06-25-2009

Quoting - jimdempseyatthecove

The bug-a-boo (technical term) is (I believe) TBB has a static context. If the DLL contains this static context, then there can be only one instance of this static context (barring some quirky VM page manipulation) and therefore one thread pool and then therefore one app using the DLL. (not good, not what you want).

The trick then is how do you make a DLL that resides at one VM address within all VMs that share the DLL and have different static TBB structures. (Note, a DLL can be PIC and need not reside at a fixed VM offset)

As far as I understand, the read-only sections (e.g. code)in the DLL may be loaded into real memory just once and mapped into an arbitrary number of processes. But writeable sections are mapped separatedly into each process that uses a DLL. Thus any static context in a DLL is never shared between different processes (applications) using that DLL.