It's a rather long thread, so please bare with me.
I'm trying to implement a DVB-S2 Transmitter on my six cores (12 HT) Dell PC running Ubuntu 10.10. I'm using Intel's C++ Studio XE tool set for the development. The realm of multicore, parallel programming is very new to me.
The Transmitter is basically composed of a Mepeg generator tbb:pipeline filter, a DVB-S2 compliant tbb:pipeline filter and another filter for sending out the output symbols to the Ethernet interface . I'm using a USRP2 box (directly connected to the PC) to convert the samples send on over the Ethernet to their analog values so I can visualize them on a scope. If I run the application, the output is choppy, i.e. it is not a continuous stream of samples but rather bursty.
If I look at the threading execution graph using Vtune, I see there's only one thread (core ?) running at 100%. All the other ones are mostly waiting or run for a short duration. And they are very synchronous. They run together (up to 10 threads sometimes) at every 4msec (in average), which is the period of the burst of samples I see on the scope.
My questions are : How can I make the worker threads more busy ? Does changing from pipeline to graph will help ? Or, do I have to go deeper and optimize/change some portion of the code so it can be executed across more cores and therefore be more efficient ?
Thank you for your time, Leonard
P.S.: I wanted to add a picture of the Vtune output, but couldn't using the Insert/edit Image button. Is there any other way to do this ?
Yes, I meant 4msec. I did not try to make fewer threads, but i could. What I did try though is to remove some heavy processing out of the DVB-S2 filter and the burst interval shorten to 1msec.
I just use pipeline and all filters are serial. Processing order is very important, specially for some encoding and filter algorithm used. I've tried to make one of the filter parallel and I wasn't getting the correct results. So I've stick to serial_in_order.
Something I did not mentioned previously is there's a concurrent queue between the MPEG generator filter and the DVB-S2 filter and I make it wait for something in the queue before processing.
Yeah, I think I've briefly saw the graph thread you are referring to. I'll check it out.
Waiting inside a task is never recommended, although I'm not sure what harm it would actually do in your situation, if any. But then I also don't know details about this queue. Note that there's already a queue into a serial stage, so you could build incremental state in a filter if you want, although the intermediate data items still have to be allowed to escape to the end of the pipeline, unfortunately.
The level of activity with a 3-serial-stage pipeline looks peculiar: there's only work for 3 threads, so maybe those other threads merely get teased with a possible bounty but instead of being able to steal a task they only spin briefly before going back to sleep. You might as well decrease the number of threads to 3 then, because there just isn't a lot of concurrency here.
Isn't the image processing amenable to parallel processing, though? I'd first go look for opportunities there (a parallel_for perhaps?), not in parallelising the pipeline, because probably you've got one very expensive stage and two cheap ones that don't really matter, so if you could tackle that expensive stage and break it down you would create a lot more parallel slack to be exploited.
Just some random thoughts, in the hope they make sense anyway (please correct me if they don't).
(Added) What exactly is mepeg, by the way? And does the generator generate real video or just test data, voiding my suggestion above to parallelise it instead of the pipeline?
The MPEG generator object just generate MPEG packets of random data. No video is actually being send.
I think your suggestion makes lots of sense, regardless of processing video or not. I think it will come down, as per your suggestion, to break down the most expensive stage in order to get more parallelism. Do you think using tbb:graph will help in this context ?
It seems that graph uses enqueued tasks rather than spawned ones like the other algorithms at the moment (see elsewhere), breaking normal assumptions about recursive parallelism, so that may at least cause other threads to "steal" those tasks more readily than the ones they should be stealing instead or inject affinity-based tasks that would take higher priority, but I couldn't immediately tell you whether or not it is guaranteed to work.
These FIR filters will be part of an application that already uses tbb::pipeline, but since incoming data for these FIR filters comes from other algorithms executed in these pipeline filters, could we still use a graph to encapsulate the FIR filters like illustrated below ?
If possible, I'd recommend you to try replacing the wholepipeline with a graph. Calling a parallel_for from a graph node should work well. That would give a more holistic design I think.
If the above is not an option, feeding the graph with the data from the last pipeline stage should work as well. When a worker thread is idle, it will first take enqueued tasks associated with the graph, and only then try to steal spawned tasks associated with the pipeline. And that seemsquite appropriate for this case.
Between the time I've posted the question and your reponse, I had the time to try i few things. I did tried to replace the whole pipeline with a graph and it made things worst in terms of processing. The PC had a hard time to keep up, and I did not add any processing load (i.e. new DSP algorithms ) to the DVB-S2 transmitter app. I just modified the threading framework from pipeline to graph. I might have done something wrong, but I'm kind of running out of time, so I went back to pipeline.
So if I want to interface a graph with the last pipeline stage, how would you proceed ? I tried that route, but could not find a proper way to have a pipeline and a graph working together. If you can share some ideas of how it could be done, it will be greatly appreciated.
Can you schematically describe the last stage - the kind of graph that you see there, and how it processes the data? Some code showing the graph structure might be helpful if you don't mind sharing it. But in general, the pipeline should just put data to the starting node of the graph I think.
Also I would be interested to learn what went wrong when you tried to replace the whole pipeline with a graph. Maybe we can improve something as a result.