Is it possible to combine Intel TBB and fibres (lightweight, cooperatively scheduler threads) that use e.g., ucontext.h (getcontext, setcontext and swapcontext) ?
My situation: a running task A makes a call that is unable to return immediately. So I use the same thread to process further tasks, for example task B that also makes the same call as task A. Yet another task (task C) gets started. At that point, the call made by A gets to a state where it could continue, but not the call made by B. But since the situation on the stack is "A,B,C", I cannot let A continue until B and C finish.
If I put A, B and C into separate fibres, I could start A on a different thread or at least start A right after C finishes, bypassing B. However, I suspect something will go horribly wrong inside the TBB task scheduler if I do it like this. Has anyone tested a thing like this before? Or can someone tell even without trying that it will indeed all explode?
Generally it is bad practice to mix threading models (unless the threading models are specifically coded to cooperate). TBB is "generally" configured for compute-only tasks. Please note that should your application have a set of tasks that continuously run, but are mostly blocked for I/O (or message), then you can oversubscribe the TBB thread pool by those numbers of mostly blocked tasks. This should attain what you want. An alternative is for the "mostly blocked for I/O" tasks is to use non-TBB threads.
I am aware of the fact that what I am trying to achieve is not the way TBB was meant to be used. I should probably create a scheduler of my own to deal with it, but creating one that is at least close to efficiency of the one in TBB would take a lot of time. If I can't find a way to use TBB, I will do it - I can spend the time but I would obviously prefer not to.
I don't want to create an oversized TBB thread pool at the start, since I don't really know the number of threads needed and it probably wouldn't be very efficient. But your post gave me a different idea. I have full control of those "blocking" calls (they are not really blocking calls, they just have to wait for some external event), so when A makes the call, I could create a new thread, add it to the TBB thread pool and use it to run B. The original thread would be suspended by something like a semaphore. When B blocks, I can create another thread and run C. Once A is able to continue, I can resume A's thread using the semaphore. If I wait for C's thread to finish, there won't even be any oversubscription.
It is not ideal, since I will have to create and destroy a significant number of threads, but at least this solution is compatible with TBB (even though the idea is not really consistent with the design of TBB). I just hope that the large number of inactive threads won't degrade the performance of the task stealing too much by creating a lot of "false targets", i.e., threads that have a TBB scheduler but never have tasks that can be stolen.
How about boost asio for the blocking stuff and TBB for the compute stuff? No need to go about juggling threads yourself, much less create and destroy them all the time.
Perhaps it could be useful to explore the use of fibers inside TBB, though. When tasks are blocked waiting for other tasks, they may now get buried underneath stolen tasks (until a certain stack depth is reached and the thread is blocked entirely). Perhaps this could be avoided by switching to another fiber when stealing?
The problem is that I am building a library that runs user's code (something similar to parallel_for_each). The user's code (the functor) is the one that makes the blocking call. For my code, I already have a working solution (similar to Boost.ASIO, but with TBB integration) for blocking operations, but it requires the task to be broken into two pieces. And I don't want to force this onto the library's user.
Here is a suggestion that I made some time back (don't have the link). Nobody on this forum seemed to understand why you would want to do what I suggest (it may be my fault for not presenting a clear case). I think your problem will result in you understanding the technique. Sketch:
Start TBB thread pool with nO threads overscribed.
Immediately schedule nO tasks (independent of the initiating (main) thread). At the start of this task the thread marks in TLS or some other means a flag that indicates it is one of the nO tasks. These tasks wait for an event before completing. Essentially you park the oversubscribed threads until you need them. Note, these threads belong to the TBB thread pool and this eliminates a need to create a TBB thread and destroy a TBB thread.
Your "(something similar to parallel_for_each)" picks off items and schedules a task. Instead of directly scheduling the functor you schedule a shell function that calls the functor. When the functor returns to the shell function, the shell function checks to see if it is an nO thread. If it is, it waits for the event you setup to pend nO task.
The only requirements that you impose uppon your users is a) either call your surrogate functions to read/write, or b) insert immediately in front of a suspending function call (read/write/...) a call to your event signal for waiting nO threads. (your surrogate functions perform this event signal)
This scheme isn't perfect as you will have some periods of oversubscription between the time the I/O/message completes in the non-nO thread and the time to completion of the additional task run by the awoken nO thread.
The benefit of the scheme is it is relatively easy to implement.
That sounds similar to what I suggested in my second post, except the TBB threads are not create and destroyed, but instead kept in a special thread sub-pool. I think these could be combined and the sub-pool (the nO threads) could be created as in my original post and then kept in the pool as you suggest.
I actually use something similar in my blocking IO solution, but the threads I use are not managed by TBB, which has a few advantages and some major disadvantages.
Still, I'm a bit worried about having many extra threads in the thread pool. I think I'll have to make a few experiments to see if it has any measurable impact on the task scheduler's performance. But if it is good enough, I'll probably use this solution and possibly even migrate my IO stuff to TBB threads as well. It works quite well, but there is one nasty hack that I'm not at all happy about - executing a task manually, not by a TBB thread.
The good thing about the suggestion is it should be relatively easy to setup a test configuration (assuming the remainder of the application is working to the point of producing representative loads).
One thought came to me as a negative. A drawback is the stack requirements of the nO threads will be that of the other TBB threads. This may limit the amount of over-subscription when on 32-bit system when application requires large stack/TBB thread.
The advantage of the system (assumption on my part) is this should avoid the possibility of infinite/high level recursion (because the thread does no recurse it actually blocks on event/condition variable). The "tasking" overhead is the cost of "Set Event"/"signal condition variable" + "WaitForSingleObject"/"wait on condition variable". This amount of overhead may be a good tradeoff for the protection against unruly recursion levels (to the point of crash).
Unless a better idea comes up soon, I will certainly try it and do a few benchmarks. And I work only in 64-bits environment, so the stack size hopefully won't be a problem.
What is it in the task A that the thread will be waiting for?
From your description it sounds as if it is a memory variable to be set (or I/O device flag) to indicate data ready/buffer ready. While flag not set you intend to task steal. A problem introduced by this is a potential for run away recursion.
a) all of
b) some of
the task behave in this manner? (wait for flag...)
If some of, then consider placing an atomic one-shot flag in the task that behaves like a "#pragma omp single". IOW first thread to reach there gets to execute the task, others bypass. Then in the tasks that wait, scan the task list for enqueued tasks of class that won't wait and hasn't reached the "single". And call the task directly (attempting to enter the single section). This may be lower overhead. It will require your users to identify tasks with potential for stall verses those that are compute-only. I think this will have the least overhead and will not require oversubscription. It does require that the the tasks with potential to stall to not actually stall but rather sit in a spin-wait like state (where they can now compete for the "single"-like compute tasks.
Task A is for a lot of complicated stuff to happen, but an important fact is that there is a task A1, that runs once the A can be executed, so I can use any kind of synchronization to un-suspend A. I also have full controll of the functions that the task A can use to block itself, but I want to minimize the impact my library has on the conde of task A itself.
Unfortunatelly, most task that do real work (i.e., run for long time and do major computation) can block this way. I will have some statistics as to how likely a certain type of task is to block (that is required for other aspects of the library), but this may be almost (or exactly) 100%.
What about UMS (User Mode Scheduling) ?
"User-mode scheduling (UMS) is a lightweight mechanism that applications can use to schedule their own threads. An application can switch between UMS threads in user mode without involving the system scheduler and regain control of the processor if a UMS thread blocks in the kernel. "
This might be a flag in task_scheduler_init.
If the majority of the tasks block (written by users of the library you supply and out of your control), then TBB is likely not the choice for you (unless you can accept a model built on oversubscription). Oversubscription can be mitigated somewhat by using the nO technique describe earlier (or simply let the threads fend for themselves via the O/S scheduler).
UMS looks nice, but I need to support Linux systems as well and from what I could find it seems that such thing is not available. But the fibres would probably be enough for me - the problem is that I don't think they would mix well with TBB.
But I'd really prefer not having to abandon TBB, since I think even when not working the optimal way (e.g., with oversubscription), the TBB scheduler should work better than something that I can implement in reasonable time. If this turns out not to be true, I can in fact spend the time, but I would still much prefer not to. Also, having the library built on top of TBB has other advantages, like allowing the user to use TBB parallel algorithms or even create tasks manually.
After some experimentation (going as low as the TBB source codes) it seems to me that I won't be able to use my proposed solution (dynamically adding extra worker threads to the TBB thread pool when needed), because those new threads end up in a different task arena, so they cannot steal tasks from the blocked threads. Seems like I will either have to abandon TBB or settle with a fixed size thread pool.
I've also come across some code that clearly shows that fibres won't work. Inside the task scheduler, there is a small function called can_steal(), which checks the current position of the stack to see if it is getting full. Having multiple stacks per thread would break this.
The question here brings to mind the C++ "await" proposal that has been discussed. Microsoft has done some interesting things here in conjunction with the PPL.