Hello Chris,Sorry for the

Hawkes__Rycharde · ‎09-10-2013

I'm posting it as separate surprising discovery that guarding try_put() as follows makes it work again in multithreaded mode,
Igor

[cpp]
{tbb::spin_rw_mutex::scoped_lock lock(TryPutMutex);
res = get<0>(out).try_put(pair);}
[/cpp]

Christophe_H_Intel · ‎09-11-2013

Hello, Igor,

Thank you for the report; it corroborates something I have been researching.

Could you try to apply a patch to your copy of TBB and see if it fixes the problem? I am attaching the diff file. The file modified is include/tbb/internal/_flow_graph_impl.h .

Please let me know if this fixes your problem.

Best Regards,
Chris

Hawkes__Rycharde · ‎09-11-2013

Yes, that fixed it on my Windows machine, debug build. But there must be some other problems as well as it still crashes in the Release build for Linux.

Christophe_H_Intel · ‎09-12-2013

Hello, Igor,

Thank you for the update. I was able to reproduce the bug here, and I think I have a fix. Could you apply the following patch to your copy of tbb (the first file is the same code I previously sent you; the second patch (include/tbb/flow_graph.h) was the second bug.

Please let me know if this works, and thank you very much for your time.

Regards,
Chris

msomeone · ‎10-09-2013

Hello,

this patch is not working for me (applied to latest 4.2 src), dunno about Igor (in our app try_put is also called from multiple threads). Flowgraph randomly hangs after several thouthands or hundrerds of re-runing (= calling try_put to 'root node' and wait_for_all) (tested on Windows and Linux). Issue could not be reproduced on 4.1 release. We've even tested with empty (multi-)function node bodies - no luck, still hangs :(

Any suggestions, please?

Christophe_H_Intel · ‎10-09-2013

Hello, msomeone,

Are you also using the limiter_node? Is there something you can let me look at to try and debug the issue?

There is another bug in limiter_node in 4.2; if a continue_msg is put to the decrement, but if the count of the node is zero, it is "decremented" and becomes a large positive number. I am attaching a patch file to fix this based on the 4.2 OSS source release.

There is at least one other corner case in limiter_node which we are currently addressing (A situation where a message is fetched from a predecessor but no successor is available, where the node will drop a message.) We haven't seen it in user's code yet. Both these bugs were in the last 4.1 TBB, too, so possibly they are not what you have found.

The only other change I could find between the last 4.1 and 4.2 that might cause a hang that hasn't happened before is the change of the task handling to spawning instead of enqueueing. If you could rebuild with -DTBB_DEPRECATED_FLOW_ENQUEUE and see if that has an effect?

Thank you for the post, and I hope we can help you.

Regards,
Chris

msomeone · ‎10-10-2013

Accidently double-posted. Message is in the next post:

msomeone · ‎10-10-2013

Hello, Chris,

Thanks for your comments and patch, but we are not using limiter_node. Unfortunately there we don't have simple app to reproduce the issue. When our app was compiled with -DTBB_DEPRECATED_FLOW_ENQUEUE hanging issue has disappeared.

Right now we're suspecting that problem is with our ill-formed graph and few questions arise about correctness of flow graph usage. Let me first explain how flowgraph is used inside app... In mentioned app flow graph is used as mechanism to drive dependent tasks of rendering engine. All data dependencies and execution order between computation(CPU only) and rendering tasks(CPU + GPU) are hand-wired in graphical editor, which generates source file with nodes and edge construction, we then implement nodes bodies where needed.

Ofcourse rendering tasks could not be executed inside TBB worker threads because multithreaded OpenGL usage is prohibited (without any wrappers, GL context switches and call-deffering). So actual rendering tasks were executed in separate dedicated render thread, by means of message passing through concurrent_queue from wrapping-render-task (body of function node executed in TBB working threads) to dedicated thread. Each actual render task (inside render thread) has pointer(s) to queue_node, which have no predecessor, but connected to dependent nodes (tasks or join\broadcast). When actual render task has executed in dedicated rendering thread, it outputs its results to queue_nodes and execution of dependent nodes proceeds as expected. <b>We now suspect that those "hanging in air" queue_nodes which have no predecessors may result in ill formed graph which hangs 4.2.<b>

Final node (task EndFrame) in graph passes 'restart' message to dedicated render thread which leads render thread to:

1) swaps buffers (GL)

2) call wait_for_all on graph to ensure it finishes (which never resulted in rendering thread beign executing "computation tasks", because last has already finished)

3) puts continue_msg to first node - BeginFrame (do {} while ( ..->try_put(..) )

4) wait for messages that render task can be executed

Old implementation of described system used barriers to ensure that render task wrapping body (executed in function node) will not finish until actual render task finishes its execution in dedicated rendering thread and then copied output(s) from it, so no "hanging in air" queue nodes were used. But that implementation resulted in TBB worker threads doing less usefull work and poorer engine performance. So described above implementation was developed.

Below images with simplest possible drawn in editor graph and its tbb implementation are present for clarity.
What is hand drawn in editor: (Also attached in minimal_src.png)

All lines (connections) between nodes represent make_edge operation. All have concurency limit set to 1 and all default policies.
BeginFrame, EndFrame, BurnCPU_1, BurnCPU_2 are bodies executed in function nodes and implemented by developer.
GPUClearScreen is simple wrapper body, which actually passes message to render thread with information about rendering task requiered to execute.
tbb_continue_msg, join_node, queue_node and broadcast_node correspond to tbb::continue_msg, tbb::join_node, tbb::queue_node and tbb::broadcast_node respectively.

What is actually gets generated from above graph: (Also attached in minimal_dst.png)

Above is real picture of what is going on for hand-drawn graph. Grey rectangle covers code executed in TBB working threads. Violet rectangle covers code executed in
dedicated rendering thread. Rendering thread is simple, it infinitely spins and tries to pop (try_pop) from concurrent_queue (spining was cheaper than conditional waiting, in our case) and execute rendering task, or swap buffers.
As it is seen from picture GPUClearScreen has no successors, and is a simple function node, with body pushing message to renderer's concurrent_queue. Actual rendering code is executed in rendering thread when it receives message from concurrent_queue, and real output result of GPUActualClearScreen is put to queue_node.
It is important to note again, that queue_node has no predecessors connected to it, and results are hand-pushed (actually done by generated code) to it.

-------------------

Sorry for quite long message and cumbersome explanations.

Are queue_nodes without predecessors cause of graph hanging, or it could be something else. Any ideas how hanging queue_nodes problem could be solved with minimal stalling?

I suspect that it would be hard to understand without any code, so if needed i`ll start working on minimalistic as possible version of engine which will be easy to build and debug.

Regards, Max.

Hawkes__Rycharde · ‎10-11-2013

Hello Chris,
Sorry for the delay. The two patches applied to tbb42_20130725oss AND an important fix to my priority queue implementation (that had a global variable and hence was not thread-safe!) completely solved my problems then. Decrementing a zero count of limiter_node to be zero still is also a welcome fix. Which version of TBB is going to include all these patches? I'd have to read carefully Max's messages and follow this topic to see if there are other problems with try_put (hope not).
Thanks
Igor

Christopher Huson (Intel) wrote:

Hello, Igor,

Thank you for the update. I was able to reproduce the bug here, and I think I have a fix. Could you apply the following patch to your copy of tbb (the first file is the same code I previously sent you; the second patch (include/tbb/flow_graph.h) was the second bug.

Please let me know if this works, and thank you very much for your time.

Regards,
Chris

Christophe_H_Intel · ‎10-14-2013

Hello, Max,

At first glance the graph you constructed should work. Do you have a non-separate-rendering thread version (the orange F(x) node in your pictures) that succeeds without the -DTBB_DEPRECATED_FLOW_ENQUEUE switch?

The main difference in behavior between spawning and enqueueing is the spawned tasks are executed in a LIFO manner (the last task created is the first task executed.) Other threads that become idle and are part of the work group will attempt to steal tasks from other threads, looking at the "bottom" of the stack of tasks for available work. This works very well for parallel constructs that use "divide and conquer" (a parallel_for will have the largest sub-region of work on the bottom of its stack, so stealing will take a large chunk of work that the other thread will then subdivide and start working on.)

That feature of task scheduling by spawning is not as important for flow::graph, but the other result of spawning vs. enqueueing is locality of reference. This is also implemented in pipeline, and helps reduce cross-core memory traffic.

Enqueueing, on the other hand, schedules tasks in a FIFO manner. The downside of enqueueing is the queue is global, so it is a heavier process to put tasks in and take them out. Spawning works almost entirely on a local stack, so it requires less locking.

Enough of my rambling. Given that I cannot see a reason why enqueueing works while spawning doesn’t, I can only give you generic advice. If you can make a version of the graph that does not invoke a task on the GPU, I’d try that and see if it works with spawning. Then if it doesn’t, I’d try cutting the case down (take out the BurnCPU_2 node and the join node, attaching the queue directly to EndFrame) to see if it can be made to work.

How do you pass the dummy input to BeginFrame? Is it via try_put?

The queueing join_node has queues on each of its inputs, so the way the graph is structured now the queue_node is not necessary, but it doesn’t hurt anything.

My other question is how many threads are you using in the graph? (If you have a task_scheduler_init call, what is the count; otherwise how many cores does the system have?) The reason to change enqueue to spawn was to fix some corner cases where only one thread was used.

I hope we can help you get this working.

Best Regards,
Chris

jimdempseyatthecove · ‎10-14-2013

Max (msomeone),

Why did you choose flow-graph instead of parallel_pipeline?

Jim Dempsey

RafSchietekat · ‎10-16-2013

A question to Christopher:

I like the attempt to use traditional TBB scheduling even for a flow graph, but how does that affect fairness and composability? Unless operated from separate arenas, multiple pipelines can starve each other, so I can imagine that a similar thing goes for flow graphs, unless the design explicitly avoids that (how?).

Maybe that could explain the current problem, maybe it's something else, but I'm curious nonetheless.

try_put() not thread-safe?!