Thanks a lot guys. You

Alex_S_1 · ‎03-31-2014

Hi, I'm considering to use TBB for its graph functionality. I need to strictly bind every graph node to specific CPU core/thread (for example, there could be 10 graph nodes and 3 cores/threads). It seems to me that TBB does not provide such a level of control over core/thread affinity for graph nodes, am I wrong? Thanks a lot.

RafSchietekat · ‎03-31-2014

Why do you "need" that?

Alex_S_1 · ‎03-31-2014

Every graph node is a component that runs on different hardware node (GPU(i), iGPU etc.). So, there is a "binding" between a specific graph node and specific CPU thread that manages specific hardware device.

Alex_S_1 · ‎03-31-2014

Also, there is some heavy processing legacy code that runs on CPU. The processing (both CPU- and GPU-based) is done in parallel with other activities of host systems, which is (semi-) real-time system. To ensure performance, CPU cores will be divided between different CPU "users" (host system and client processing module).

RafSchietekat · ‎03-31-2014

Seems like you should go old school on those cores, by running your own threads.

I won't say that TBB should never provide such a "feature" for specialised uses (if it could easily be done), but in general handing over data between threads/cores is not the appropriate strategy to maximise throughput, so...

jiri · ‎03-31-2014

TBB is not designed for this kind of thread management. What could also get you into trouble is this "CPU thread that manages specific hardware device". This often involves making calls that block for a long time, which is really bad for TBB.

TBB is quite good at cooperating with other parallel libraries, so you could still consider using it for some parts, but based on the little information that you provided about your system, I don't think it is the case.

RafSchietekat · ‎04-01-2014

jiri wrote:

TBB is not designed for this kind of thread management. What could also get you into trouble is this "CPU thread that manages specific hardware device". This often involves making calls that block for a long time, which is really bad for TBB.

That's not necessarily true: look at pipeline's thread_bound_filter. Its thread is separate from the threads normally active on behalf of TBB, so it may cause some oversubscription overhead (which is why it is not a preferred solution), but does not cause blocking (which would be far worse, associated with anything from undersubscription to deadlock). It just hasn't been implemented for a flow graph (yet).

jiri · ‎04-02-2014

I'm aware of the thread_bound_filter, I just didn't think it is the solution in this case. And based on what we know about the problem, I strongly suspect TBB won't be a good fit - lots of legacy CPU code, GPUs, and a real time system. Of course, I can't be sure without knowing a lot more about the system.

Alex_S_1 · ‎04-03-2014

Thanks a lot guys. You reconfirmed my thoughts that TBB does not seem to provide thread/core affinity for graph nodes (at least currently)…

I may add more explanation on the problem. In a general case, the system has one (or two) GPU cards. For simplicity though (and also because it does not make sense to move intermediate results back and forth between two GPUs) let's assume the system running one GPU. The input data is 1kx1k float buffer, which keeps both CPU and GPU fully utilized and do not allow effective task parallelization (at least for GPU – due to memory limitations). CPU runs legacy code which is not optimized for multi-core architecture (also there is no plan to do such multi-core optimization, but rather to move old CPU code to GPU). So, it brings me to two "compute" threads: CPU compute thread and GPU-manager thread. On GPU, processing tasks are pushed asynchronously, so this GPU-manager thread is idle most of the time. The synchronization is done only in split/sync points on the graph. For some general case, there could be 3 split/sync points for total of 10 blocks. Obviously, it is not a "good" thing to move data between CPU and GPU devices but 1) the overhead is low relatively to total flow time in my specific case, and 2) there is no other way around (we have this legacy code that later will be moved to GPU). Hope it sheds more light on my problem domain…

In any case, what I'm looking for is a "closed" solution for graph functionality, with an addition of binding my graph nodes to specific thread (and also core, as I'm not the only user of our controlled system). TBB provides more general case while reducing burden from user to manage threads, and do it probably well. But I need to "bind" a graph node to (at least) a GPU device…

jimdempseyatthecove · ‎04-04-2014

>>
Every graph node is a component that runs on different hardware node (GPU(i), iGPU etc.). So, there is a "binding" between a specific graph node and specific CPU thread that manages specific hardware device.
...
Also, there is some heavy processing legacy code that runs on CPU. The processing (both CPU- and GPU-based) is done in parallel with other activities of host systems, which is (semi-) real-time system.
<<

This suggestion will be somewhat of a kludge. Considering the investment in time you may have in TBB you may find it "elegant"

1) Create a non-TBB thread for use to interact with a GPU (one thread for each GPU).
2) The (these) thread(s) will loop { wait on a condition variable/single event, pass work request to GPU, wait for GPU to finish, dispose of results or data, signal completion event }
3) Affinity pin those threads to the appropriate CPU/core
4) Set the priority high these threads will be waiting most of the time on the condition/event or GPU to finish.
5) You will have to tune the size of the TBB thread pool to oversubscribe by the number of threads that approximate the number of concurrent GPU nodes being processed
6) The TBB node(s) responsible for launching GPU task packets the request for GPU, signals non-TBB thread's condition variable/single event, then waits for completion event. Note, this TBB thread will be in sleep state during wait. This will make its wait time available to the oversubscribed TBB threads.

You may find some improvement in throughput by having the non-TBB threads perform a small spinwait when no work is found.

You may also find it beneficial to incorporate a FIFO queue or queues (targeted GPU, untargeted GPU, high or low priority GPU, ...)

Jim Dempsey

Graph nodes and core affinity