TBB and blocking don't mix very well. Maybe you should consider waiting for a limited time only (zero or non-zero), and then restarting the pipeline when new data is detected, if you can suffer the extra latency associated with draining the pipeline (average and maximum). Depending on the statistical characteristics of the input, some tuning of the input time limit may be required for best results. Even if you aren't looking at energy consumption, a stalled pipeline should be stopped to prevent different degrees of starvation elsewhere in the program.
[cpp]SerialRunUpcaseWordsTest = 5.440969 seconds ParallelRunUpcaseWordsTest = 1.415301 seconds Serial/Parallel = 3.844389 [/cpp]BTW, the I/O buffer sizes are 10,000 bytes (same as in TBBexample). A larger buffer might have yielded better performance (i.e. multiple of sector size). I have not run this against the equivilent TBB sample yet.
"As for the CPU utilization: the percentage would not necessarily reduce with increased number of cores; I would expect the execution time to decrease instead. If you talk about idle-spinning time only, then I agree CPU utilization should decrease; but the behavior I described above might be responsible for additional idle spinning."
Alexey, thanks for some insight into MKL. We will attempt to place some control on MKL's configuration and see what happens relative to the threads. However, I am not following your comment on the system utilization. I originally had 2 cores being consumed for a total of 60%, generating a result on an input data packet in about 150 ms. I move to 8 cores and now I have 8 cores being consumed at 60% (of the total system) and still generating results in about 150 ms. If you are implying that a core's utilization should not change much, I would generally agree (depending on how mcuh MKL is decomposing the work it is being given) . But to imply that each core is being consumed at 60%and the final result is the same as on a 2 core machine, is not adding up. I'm using 4x the processing of the Core 2 Duo and achieving the same result?
BTW, The algorithmic portion of this app is a port of existing code that used MKL on a 3 machine cluster of dual Intel Itaniums (SGI Altix). When it runs is uses about 20% of the total CPU resources available (at least in one operational mode).