Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.

CPU affinity of a pipeline

Vishal_Sharma
Beginner
3,093 Views

Hello,

I have implemented an application that uses Intel TBB pipeline pattern for parallel processing on Intel Xeon CPU E5420 @ 2.50GHz running RHEL 6.

The application basically composes of 8 pipelines. Each pipeline has one token (making it one thread per pipeline). Each pipeline receives data from an endpoint and processes it to completion. I ran this application and collected general exploration analysis data using vTune amplifier. The profiler reported high CPI in finish_task_switch function of vmlinux module which suggests that the kernel is spending more time performing context switching and adversely affecting performance of the application.

What I would like to understand is why is the kernel performing high context switching? Will each pipeline be scheduled on the same CPU? Is there a way to assign CPU affinity with each pipeline? How can reduce this performance impacting behavior? Please provide some optimization tips.

Thank you,

Vishal

0 Kudos
36 Replies
Vishal_Sharma
Beginner
1,099 Views
Hi Raf,

As mentioned in one of my earlier reply, I was out sick and could not test the approach suggested to me earlier. Second, Intel's TBB is new concept for me. I am eager to trying out suggestions made by experts such as yourself and to learn in process. I have described my implementation in one of the replies today. If you think that the program does not make use of what TBB has to offer, please help me understand how I can make use of TBB correctly.
I am ready to provide information that may help you to make better suggestions.
Best Regards,
Vishal
0 Kudos
RafSchietekat
Valued Contributor III
1,099 Views
From my previous posting:
"The second experiment, after increasing the number of tokens, would be to keep that number at one and make the filters parallel, or apply the correct setting for each filter individually and try various numbers of tokens."

I don't have high hopes, but at least the program will look more like a real TBB program, making this more generally relevant.

I just looked up the specifications for your CPU, and I'm seeing a processor from 2007, with 4 cores with 4 threads, so that would be without hyperthreading. With 8 threads to begin with, you're already potentially oversubscribing the machine. I have to admit that I can't confidently predict what can be expected here, but context switching should not be a big surprise in the presence of oversubscription.

(Added) Even if input and output filter should be serial_in_order where the middle filter can be parallel, also try to make them all parallel with just 1 token.
0 Kudos
Vishal_Sharma
Beginner
1,099 Views
To begin with, I had the following settings:
- Input filter: Serial (is_serial = true),
- Process filter: Parallel(is_serial = false), and
- Output filter: Serial(is_serial = true)
- Number of pipelines = 8 (1 per destination),
- Number of tokens = 1 (per pipeline)
Built and ran the application on aIntel Xeon CPU X5680 @ 3.33GHz which is a faster and newer processor (Has 24 hardware threads). This should eliminate potential oversubscribing issue. The vTune performance analyzer reported high CPI, instruction stalls in finish_task_switch function of vmlinux which is same running onE5420 @ 2.50GHz CPU.
Next changed the setting as follows:
- Input filter: Serial (is_serial = true),
- Process filter: Parallel(is_serial = true), and
- Output filter: Serial(is_serial = true)
- Number of pipelines = 8 (1 per destination),
- Number of tokens = 3 (per pipeline)
Built and ran the application on aIntel Xeon CPU X5680 @ 3.33GHz. The vTune performance analyzer reported high CPI, instruction stalls in TBB Scheduler internals and thread_return function of vmlinux. which is same running onE5420 @ 2.50GHz CPU.
Next working on the setting as follows:
- Input filter: Serial (is_serial = false),
- Process filter: Parallel(is_serial = false), and
- Output filter: Serial(is_serial = false)
- Number of pipelines = 8 (1 per destination),
- Number of tokens = 1(per pipeline)
Built and running the application on aIntel Xeon CPU X5680 @ 3.33GHz.
Regards,
Vishal
0 Kudos
RafSchietekat
Valued Contributor III
1,099 Views
So does the all-parallel situation run more efficiently or not? The idea is that it would run all on the same thread without stealing, and that might provide a clue, especially in comparison with a program that eliminates the pipeline and just repeatedly calls the filters in order.

Something else you could do is use task_scheduler_init with just 1 thread, to avoid the generation of any TBB worker threads that might go around stealing work.

If that doesn't show anything interesting, I would have to defer to somebody with more experience running TBB from multiple user threads (I prefer a setup without any significant blocking inside TBB code), sorry. Perhaps somebody from Intel has an idea?
0 Kudos
jimdempseyatthecove
Honored Contributor III
1,099 Views
Vishal,

Could youmake a test program using the parallel_invoke outline above. This should be relatively easy for you to do.

Note, I understand that your actual application may not be able to use the parallel_invoke due to variable numbers of pipelines and/or irregular start/finish times. The paralle_invoke method posted earlier is a quick way of making an assessment as to the route to take.

For an unknown number of pipelines (at programming time), but where all the pipeline requirements are determined at an initialization time, AND where all these will be launched and run to completion (as if a single task), THEN consider using the technique as done int the parallel Fibbonacci method (with slight change):

void doPipeline(int n, int m)
{
if(n==m)
{
Flows Flow;
Flow.SetupFlow(n); // *** remove boost thread creation, remove init
Flow.StartFlow();
return
}
int step = m - n;
parallel_invoke(
[&](){ doPipeline(n,n+step); },
[&](){ doPipeline(n+step+1,m); });
}
...
int main(...)
{
task_scheduler_init init(nHWthreads + pendingIoThreads);
doPipeline(0, nFiles);
}

Jim Dempsey
0 Kudos
Vishal_Sharma
Beginner
1,099 Views
Hello Jim and Raf,
It may be a while before I get to this as I have beenpulled into looking at a critical issue. Please bare with me.

Thank you,
Vishal
0 Kudos
jimdempseyatthecove
Honored Contributor III
1,099 Views
Vishal,

In an earlier post I mentioned a diminishing returns relating to increasing the number of parallel pipelines as it relates to affecting I/O times, specifically seek times. Not having your code makes this hard to asses your situation.

As a test example I used my threading toolkit, QuickThread, which has parallel_pipeline, however I am experimenting with something new at this time called Parallel Manifold. With Parallel Manifold you can construct the equivilent to a parallel_pipeline. File read stage, parallel process stage, File write stage. Parallel Manifolds are much more flexible than parallel_pipelines.Parallel Manifolds area data-flow and resource flow paradigm using the QuickThread task scheduler (similar to TBB in some respects).

Test program is a variant of what was in the first TBB book for parallel_pipeline. Read a largesequential file, uppercasing each word, writing an output file. The upcase pipe is run in parallel.

My system is a Core i7 2600K Sandy Bridge. 4 core with HT, running Windows 7 Pro x64.

For the test, I wanted to reduce disk seek latencies as much as possible. I placed the input file (1.3GB) on D:, Seagate ST3160811AS, and wrote the output file(s) to C:, Seagate ST3750630AS.

The file I/O uses simple fopen, fread, fwrite. IOW no asychronous I/O to Windows. I/O buffers were set at a respectible 128KB.

Base line Serial version of the process:

49.8 seconds, 26.7MB/s Read, 53.37MB/s Read+Write

"parallel pipeline" using Parallel Manifold paradigm

10 threads: 8 compute class threads, 2 I/O class threads
(QuickThread has two classes of threads)

10.7 seconds, 124.0MB/s Read, 241.6MB/s Read+Write, 4.65x that of serial

Note, I/O is one of the limiting factors.

Now for the part of this discussion that focuses on your problem.

Modify above code to run eight instances of the above process (in one process using one thread team).

All eight "parallel pipelines" used the same input file but different output files.
These pipelines (Manifolds) were not phase synchronized so the shared input file will not necessarily be reading from the same position, nor re-using the same I/O buffer. ThereforeReads might be received faster than if on seperate files, however, read seek latencies and disk cache eviction will not be eliminated.

24 threads were used: 8 Compute Class threads, + 16 I/O class threads
(each pipeline has a Read and Write thread available, same as one pipeline earlier)

All 8 files in parallel

170.8 seconds, 62.2MB.s Read, 124.5MB/s Read+Write, 2.4x that of serial

** Note this is 8x the data
Looking at one of the pipelines

170.8 seconds, 7.8MB/s Read, 15.5MB/s Read+Write, 0.3 that of serial

** (1/8 the aggrigate)

The important figures are:

241.6MB/s for single pipeline is reduced to
124.5MB/s for eight concurrent pipelines

What this means is (on this system)the ratio of I/O latency to Compute timedoes not favor using additional parallel pipelines. *** This is not a general statement. If you have a fast RAID system I/O will be (should be) less and change the ratio (as to if good or bad, your milage will vary).

Jim Dempsey
0 Kudos
RafSchietekat
Valued Contributor III
1,099 Views
"Please bare with me."
I categorically refuse!

P.S.: But I will "bear" with you. :-)
0 Kudos
RafSchietekat
Valued Contributor III
1,099 Views
When considering alternatives, I would probably do all the I/O on a separate thread (for lack of QuickThread's I/O tasks), building data items into a concurrent_queue, and use parallel_while/do to execute work with its feeder taking work from the queue. In the separate thread, select/poll would detect what input is ready, and parallel_for would process the current batch to prevent starvation of any input.

But I am still interested in the root cause of the perceived problem, because guaranteed concurrency between separate arenas is meant to be a feature.
0 Kudos
jimdempseyatthecove
Honored Contributor III
1,099 Views
>>I would probably do all the I/O on a separate thread...building data items into a concurrent_queue, and use parallel_while/do to execute work

You would potentiallywant multiple threads, one for each (potential) I/O.
The TBB concurrent_queue does not have a task initiator on first fill situation. Sure you can do the select/poll, but this is equivalent to a spinwait (with task stealing). One potential problem you have is when the I/O exceeds the processing capacity. In this case you may consume memory with unprocessed data. You can correct this by throttling the I/O thread (e.g. count, or starvation for buffers), but then you must insert domain communication (e.g. WaitForSingleEvent or condition variable(s)). This is unclean and open to programming error.

The Parallel Manifolds could be though of similar to concurrent_queue, in respect to being FIFO MPMC, but it also Ordered output and various combination of MP, SP, MC, SC, but more important it provides for a task trigger mechanism. e.g. on First Fill it will initiate the consumer task(s). The manifolds are more complex than the simple description above, as you can have multiple input ports and/or output port, each fed by one or more threads, with the trigger firing the consumer taskonly when all ports are satisfied and there is an available consumer. You can interlink the manifolds as you please (as long as it makes sense with your data/resource flow).

Jim Dempsey
0 Kudos
jimdempseyatthecove
Honored Contributor III
1,099 Views
>>P.S.: But I will "bear" with you. :-)

Good catch ;)

Perhaps this should be "beer" with me.

Jim
0 Kudos
RafSchietekat
Valued Contributor III
1,099 Views
"You would potentially want multiple threads, one for each (potential) I/O."
If it doesn't scale well, why go there at all?

"The TBB concurrent_queue does not have a task initiator on first fill situation."
Indeed, that would be nice to have, I've been thinking about this myself on several occasions.

"Sure you can do the select/poll, but this is equivalent to a spinwait (with task stealing)."
How so?

"One potential problem you have is when the I/O exceeds the processing capacity. In this case you may consume memory with unprocessed data. You can correct this by throttling the I/O thread (e.g. count, or starvation for buffers), but then you must insert domain communication (e.g. WaitForSingleEvent or condition variable(s)). This is unclean and open to programming error."
True, #29 was rather rash, even without the objection that parallel_while/do would jumble the data items between input and output (a single pipeline seems better, even if it imposes a global order above the desired per-channel order, probably adversely affecting latency, and it would definitely need complete messages as input to avoid outright starvation). But I hope I would have come to my senses early enough to try the relatively new TBB flow graph with libevent instead (although I secretly wish that Vishal doesn't do so before the current problem is diagnosed and perhaps solved...).
0 Kudos
jimdempseyatthecove
Honored Contributor III
1,099 Views
>>"You would potentially want multiple threads, one for each (potential) I/O."
If it doesn't scale well, why go there at all?

When a process incorporates I/O, then scaling is not based upon cores alone. Rather it becomes an interrelationshipamongst:

threads (not cores)
I/O subsystem
O/S driver capability

Vishal's problem description has 16 files (at least from my understanding of these forum threads).
If you do not oversubscribe the threads, then the time interval (say) between the file read statement and the return from read (I/O completion) is a stall time for the issuing thread. IOW for each pending I/O you have one less compute thread available for doing work in your TBB thread pool. (assuming TBB threads areperforming I/O). To resolve this (within TBB), you oversubscribe threads under the assumption/requirement that you will have pending I/O operations. The degree of over-subscription would depend upon the I/O subsystem and the I/O latency compared against compute requirements. In situations where (sum of) I/O latencies is a very small amount of overall run-time, then you would not over-subscribe. As this ratio (I/O latency : compute) increases then you may find it beneficial to add additional threads.

Observing the Task Manager (or equivalent on Linux) can be used as an indicator for tuning the degree of over-subscription. In QuickThread this is a non-issue due to the dual classes of threads, and as a result the compute class can be maintained at the full compliment of available hardware threads.

FWIW When running the 8 concurrent pipelines, the I/O queue depth chart of the Resource Manager showed a jagy 7-9 queued writes on the output disk.Only 8 could be pending from my app, so this must be an O/S buffered lazy writeissue.The input disk showed ~0 queued reads. This indicates that the entire input file had been cached either by the disk read ahead or by the system file cache (system has more RAM than file size). Had this been run without the additional threads, then either computestarves or I/O starves.

The published specifications for this output drive was 120MB/s sustained writes. My single pipeline (Parallel Manifold) attained 124MB/s. The faster than published speed may be attributable tosweet spot on disk.The 8 pipeline performance aggregatewrite speed was 26.7MB/s, the drop in performance is due to seek latencies introduced by writing to 8 different output files (i.e. 8 separate output streams with significant seek distances as opposed to streaming to single file with track-to-track seeking). Had multiple output drives been available, then targeting multiple output drives may have been in order.

Jim Dempsey
0 Kudos
RafSchietekat
Valued Contributor III
1,099 Views
Sorry for the confusion, I meant specifically that the program architecture does not scale well to many channels if they each require a software thread. And I don't think the original problem was about files on disk, not explicitly anyway.
0 Kudos
jimdempseyatthecove
Honored Contributor III
1,099 Views
>>Sorry for the confusion, I meant specifically that the program architecture does not scale well to many channels if they each require a software thread. And I don't think the original problem was about files on disk, not explicitly anyway.

Raf, if you dig through my comments you will find that they are indicative of using more pipelines than your I/O subsystem is capable of _effectively_ using is counter productive. The 8-way pipeline using two disks shows a drop in aggregate throughput (to about half).

Additionally, the optimal tuning for I/O threads would be such that the disk queue would have no more entries than necessary to not fall to 0 (during peakperiods of low latency transactions).

To illustrate the point, I will tune for 8 compute threads, 1 I/O thread, two parallel pipelines (using Parallel Manifolds), each pipeline writing to separate disk....

1 I/O thread: 124.77MB/s (write), virtually no difference from using 4 threads (for dual pipeline)
2 I/O threads: 124.48

Reconfigure to use compute class threads for I/O (0 I/O threads), as would for TBB pipeline:

0 I/O threads: 65.71MB/s

Oversubscribe compute class by 1 thread, 0 I/O class threads

+1 compute, 0 I/O: 66.03MB/s

As you can observe, over-subscription of a single compute class (like with TBB), yields slight performancegain (from 65.71MB/s to 66.03MB/s). Where as using an additional threadseparated from compute class (like QuickThread) yields significant performance gain (65.71MB.s to 127.77MB/s).

I fully expect that, using TBB parallel pipeline, in conjunction with non-TBB threads for performing I/O, would yield similar improvement over using TBB threads for I/O.

A little extra work can pay big dividends.

Jim Dempsey
0 Kudos
RafSchietekat
Valued Contributor III
1,099 Views
OK, one last try... I proposed one I/O thread (which doesn't seem such a bad idea according to #35), you countered with "one thread per I/O" (which I interpreted as O(8) threads, basically the situation that this forum thread is really about), I objected to that (which I think you interpreted as objecting to 1 thread per disk device), and I hope we can now agree that really we more or less agreed all along. :-)

Optimal tuning for physical devices seems like a difficult problem, perhaps beyond the scope of this forum. Still, your mileage may differ with different file systems: have you seen the same degradation with concurrent pipelines using, say, ZFS, even on the same number of disks (just from hearsay, though)?
0 Kudos
Reply