I have an application that starts 3 worker threads, then each worker thread creates a tbb::task_group and submit tasks to the task_group at 500ms intervals and roughly at the same time. I log the startup delay(time when the task is actually executed - time when the task is submitted to the task group) of each task. Typical execution time of a task in each worker thread is 400ms(thread 1), 400ms(thread2) and 50 ms(thread3).
I run the application on a 4-core i5-4460 + Win10 x64 machine and observes negligible startup delay(0-2ms) for all tasks in thread 1 and thread 2, but the startup delay for tasks in thread 3 varies from negligible to 300-400ms. The overall CPU usage is around 70%-75%.
If I stop thread 1 or thread 2, startup delay for all tasks in thread 3 becomes negligible. If I restart thread 1 or thread 2, high startup delay can be observed in thread 3 again, while startup delay for tasks in thread 1 and thread2 is still negligible.
If I replace tbb::task_group with a simple thread pool like https://github.com/progschj/ThreadPool , startup delay for all tasks in all worker threads becomes negligible.
It seems to be a bad idea to use tbb::task_group as a generic thread pool for asynchronous task execution, because the high startup delay is unacceptable in many situations, or can we prevent this by tuning the task scheduler?
Maksim D. (Intel) wrote:
Could you please share some information about your case. Could you please show the source code.
Sorry I'm not allowed to show the source code, I'll provide information about my application:
It's a machine vision application using 3 cameras, each worker thread is associated with one camera and continuously process images fetched from that camera. Thread 1 and thread 2 have similar image processing workflow, and thread 3 has its unique(much simpler) workflow.
Thread 1 and thread 2's workflow is like this: launch 2 asynchronous sub-tasks, wait 300-400ms until both sub-tasks are finished, then do some simple post-processing. Each sub-task is executed in a thread pool with only 1 work thread(the thread pools are created at application start and are re-used to process multiple sub-tasks). I can't simply use tbb::parallel_invoke to execute these 2 sub-tasks in random threads because both of them involve GPU via Caffe+cuDNN and, for stability reasons, all code accessing a GPU-enabled caffe::Net object must be done in the same thread that creates it(the caffe::Net objects are pre-created and initialized at application start, just after creating the thread pools).
Sorry for a long respond. I am not sure that I understand how Intel TBB worker threads are used. Are the three threads for cameras created manually or are they Intel TBB worker threads? How the camera threads wait for sub-tasks completion (is it conditional variable or something like this)? Do you have parallelism inside sub-tasks (e.g. parallel loop)?