I'm adding a plugin to an existing app which will create a thread for every core. The plugin is multi-threaded and can be called by any of the app threads at any time. The plugin is sifting through a large dataset using multiple threads and, based on what is found, performing a series of tasks. I am told that performance will benefit if these tasks can be performed in parallel, and i suspect that it is best for a thread to concentrate on a specific task type to avoid thrashing memory too much. So, assuming my approach is ok (if it isn't, pls let me know), my question is this:
Am i better off stashing the tasks in task-specific queues, then using parallel for or some other mechanism to process them after the dataset has been sifted and all tasks identified?
Or, alternatively, maintain a thread pool for each task type and submit the task as it is discovered to the appropriate pool.
Or, something else :>)
The machines in question have 8 x86 cores and there are 5 task types. Each task must fetch files from the OS and perform some calculations. Obviously mileage will vary based on task type, core performance, OS, etc. but i'm hoping there are some rules of thumb which might help without getting too deeply into premature optimization.
Try not to think in terms of threads, but in terms of tasks
and dependencies, or, better yet, existing algorithms; then let TBB figure out
how to schedule them on whatever parallelism is available at any time.
Typically you should keep the data local to a core, not the type of calculation: tbb::parallel_pipeline will try to do that for you, essentally moving the stages across the data instead of moving the data through the stages.
Try not to fetch from a file inside a task, because TBB won't know when the task is blocked waiting for data instead of crunching it. You might have a user thread to read the data and feed it into a concurrent_queue that will be the source for the initial stage in the pipeline. Those issues probably trump any considerations of keeping data local to a core between input and processing, as long as the CPU-bound processing is still allowed to favour data locality.
These are mostly general principles; it's difficult to be more specific when the problem is already interpreted with a specific bias.
The functionality in TBB is directed mainly toward efficient use of processing elements, it does not currently take a position on I/O. Choose whatever mechanism you prefer (portable or otherwise), but try to avoid blocking inside a task, and don't rely on any degree of concurrency that you don't provide yourself in the form of user threads.
Is there a good way for the thread which is pushing the data into the concurrent queue to wait until the queue contents have been fully processed? The thread in question is under the control of the external application and the processing is done within TBB. The thread does not have to do other work while waiting.
There is no direct support for such a use case in the queue. So the options are either wait for consumers to notify that the queue is empty (passive waiting; this is whatRaf suggested) or periodically poll the queue state in a loop (active waiting, probably with sleeps/yields inside the loop to consume less CPU cycles). Of course in the first place I would consider passive waiting.
But is calling empty() thread-safe? The documentation is not very clear about that. With concurrent_bounded_queue() it can be deduced not to be thread safe because empty() returns "size()<=0" and size() can have strange values based on who is doing what to the queue at the time, including spuriously indicating emptiness of the queue, I suppose.
So, without more information about what the queue guarantees, I would still go directly to the consumer for polling as well as for passive waiting.
Maybe I need to look further, but as far as I can tell there is no synchronisation between push() and pop() that would not allow push() to return before a pending pop() has a chance to act, thereby temporarily allowing empty() to return true while the item is still inside the queue. So, while empty() can indeed be called concurrently with other operations on the queue, it can only tell whether it knows that the queue will soon be drained, not whether it already is.
(Added) Although the original question about items being "fully processed" would necessarily need information from the consumer(s) after their local processing has finished, not after they have merely popped the items, so this may not even be about the queue itself, after all.
The method returns whether the queue was logically empty for an observer at the moment it did the check. If I correctly understand your example Raf, the not-yet-returned try_pop() has already "reserved" the item in the queue so that another try_pop() would fail unless another thread pushes something; i.e. for observers, the queue is empty even though there might be an element not yet fully popped out of it.
And your last comment is correct indeed. I missed that the desired condition is "all items are fully processed", in which case polling the queue makes no sense.
This meaning with pending pop() (not try_pop()) operations is not clear from the documentation and potentially "surprising". Even if there may only be limited use for knowing exactly when all items have passed from the queue to consumers, perhaps the programmer should still be told explicitly that the queue will not provide that information (probably because additional counters for limited applicability would not be helpful for performance in general?) and that it would have to be obtained directly from the consumers instead. (In this case it is so clear that "all items are fully processed" can only be answered by the consumers that maybe the question really was "all items have at least commenced processing", after all.)
>>Can you tell me if there is a good solution for asynchronous i/o in TBB?
As Raf and others replied - TBB is not structured to include asynchronous I/O integral to its tasking. You can add non-TBB threads that push/pull data from concurrent objects (queues) but this type of integration is difficult to attain optimal tuning.
If you are not too deep into implementing your TBB code you might take a look at QuickThread (www.quickthreadprogramming.com) this has two thread classes: compute and I/O. The I/O task class is intended for tasks that perform I/O or other blocking operations (e.g. wait for condition variable or event). For your application the choice would be touseQuckThread parallel_pipeline with an I/O class pipe at the beginning, compute class task(s) in the middle, and optionally and I/O class task at the end (assuming I/O required for output). This pipeline is highly tuned to transition between the thread classes. Also, if you port your app to a NUMA capable platform the QuickThread parallel_pipeline is optimized for this type of environment as well. I've recently changed the licensing to permit free evaluation and direct downloading from the website. If you need any assistance in putting together a mock-up of your application as a parallel_pipeline please send me an email (firstname.lastname@example.org).