- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
I have implemented an application that uses Intel TBB pipeline pattern for parallel processing on Intel Xeon CPU E5420 @ 2.50GHz running RHEL 6.
The application basically composes of 8 pipelines. Each pipeline has one token (making it one thread per pipeline). Each pipeline receives data from an endpoint and processes it to completion. I ran this application and collected general exploration analysis data using vTune amplifier. The profiler reported high CPI in finish_task_switch function of vmlinux module which suggests that the kernel is spending more time performing context switching and adversely affecting performance of the application.
What I would like to understand is why is the kernel performing high context switching? Will each pipeline be scheduled on the same CPU? Is there a way to assign CPU affinity with each pipeline? How can reduce this performance impacting behavior? Please provide some optimization tips.
Thank you,
Vishal
Link Copied
- « Previous
-
- 1
- 2
- Next »
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
"The second experiment, after increasing the number of tokens, would be to keep that number at one and make the filters parallel, or apply the correct setting for each filter individually and try various numbers of tokens."
I don't have high hopes, but at least the program will look more like a real TBB program, making this more generally relevant.
I just looked up the specifications for your CPU, and I'm seeing a processor from 2007, with 4 cores with 4 threads, so that would be without hyperthreading. With 8 threads to begin with, you're already potentially oversubscribing the machine. I have to admit that I can't confidently predict what can be expected here, but context switching should not be a big surprise in the presence of oversubscription.
(Added) Even if input and output filter should be serial_in_order where the middle filter can be parallel, also try to make them all parallel with just 1 token.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Something else you could do is use task_scheduler_init with just 1 thread, to avoid the generation of any TBB worker threads that might go around stealing work.
If that doesn't show anything interesting, I would have to defer to somebody with more experience running TBB from multiple user threads (I prefer a setup without any significant blocking inside TBB code), sorry. Perhaps somebody from Intel has an idea?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Could youmake a test program using the parallel_invoke outline above. This should be relatively easy for you to do.
Note, I understand that your actual application may not be able to use the parallel_invoke due to variable numbers of pipelines and/or irregular start/finish times. The paralle_invoke method posted earlier is a quick way of making an assessment as to the route to take.
For an unknown number of pipelines (at programming time), but where all the pipeline requirements are determined at an initialization time, AND where all these will be launched and run to completion (as if a single task), THEN consider using the technique as done int the parallel Fibbonacci method (with slight change):
void doPipeline(int n, int m)
{
if(n==m)
{
Flows Flow;
Flow.SetupFlow(n); // *** remove boost thread creation, remove init
Flow.StartFlow();
return
}
int step = m - n;
parallel_invoke(
[&](){ doPipeline(n,n+step); },
[&](){ doPipeline(n+step+1,m); });
}
...
int main(...)
{
task_scheduler_init init(nHWthreads + pendingIoThreads);
doPipeline(0, nFiles);
}
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In an earlier post I mentioned a diminishing returns relating to increasing the number of parallel pipelines as it relates to affecting I/O times, specifically seek times. Not having your code makes this hard to asses your situation.
As a test example I used my threading toolkit, QuickThread, which has parallel_pipeline, however I am experimenting with something new at this time called Parallel Manifold. With Parallel Manifold you can construct the equivilent to a parallel_pipeline. File read stage, parallel process stage, File write stage. Parallel Manifolds are much more flexible than parallel_pipelines.Parallel Manifolds area data-flow and resource flow paradigm using the QuickThread task scheduler (similar to TBB in some respects).
Test program is a variant of what was in the first TBB book for parallel_pipeline. Read a largesequential file, uppercasing each word, writing an output file. The upcase pipe is run in parallel.
My system is a Core i7 2600K Sandy Bridge. 4 core with HT, running Windows 7 Pro x64.
For the test, I wanted to reduce disk seek latencies as much as possible. I placed the input file (1.3GB) on D:, Seagate ST3160811AS, and wrote the output file(s) to C:, Seagate ST3750630AS.
The file I/O uses simple fopen, fread, fwrite. IOW no asychronous I/O to Windows. I/O buffers were set at a respectible 128KB.
Base line Serial version of the process:
49.8 seconds, 26.7MB/s Read, 53.37MB/s Read+Write
"parallel pipeline" using Parallel Manifold paradigm
10 threads: 8 compute class threads, 2 I/O class threads
(QuickThread has two classes of threads)
10.7 seconds, 124.0MB/s Read, 241.6MB/s Read+Write, 4.65x that of serial
Note, I/O is one of the limiting factors.
Now for the part of this discussion that focuses on your problem.
Modify above code to run eight instances of the above process (in one process using one thread team).
All eight "parallel pipelines" used the same input file but different output files.
These pipelines (Manifolds) were not phase synchronized so the shared input file will not necessarily be reading from the same position, nor re-using the same I/O buffer. ThereforeReads might be received faster than if on seperate files, however, read seek latencies and disk cache eviction will not be eliminated.
24 threads were used: 8 Compute Class threads, + 16 I/O class threads
(each pipeline has a Read and Write thread available, same as one pipeline earlier)
All 8 files in parallel
170.8 seconds, 62.2MB.s Read, 124.5MB/s Read+Write, 2.4x that of serial
** Note this is 8x the data
Looking at one of the pipelines
170.8 seconds, 7.8MB/s Read, 15.5MB/s Read+Write, 0.3 that of serial
** (1/8 the aggrigate)
The important figures are:
241.6MB/s for single pipeline is reduced to
124.5MB/s for eight concurrent pipelines
What this means is (on this system)the ratio of I/O latency to Compute timedoes not favor using additional parallel pipelines. *** This is not a general statement. If you have a fast RAID system I/O will be (should be) less and change the ratio (as to if good or bad, your milage will vary).
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I categorically refuse!
P.S.: But I will "bear" with you. :-)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
But I am still interested in the root cause of the perceived problem, because guaranteed concurrency between separate arenas is meant to be a feature.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You would potentiallywant multiple threads, one for each (potential) I/O.
The TBB concurrent_queue does not have a task initiator on first fill situation. Sure you can do the select/poll, but this is equivalent to a spinwait (with task stealing). One potential problem you have is when the I/O exceeds the processing capacity. In this case you may consume memory with unprocessed data. You can correct this by throttling the I/O thread (e.g. count, or starvation for buffers), but then you must insert domain communication (e.g. WaitForSingleEvent or condition variable(s)). This is unclean and open to programming error.
The Parallel Manifolds could be though of similar to concurrent_queue, in respect to being FIFO MPMC, but it also Ordered output and various combination of MP, SP, MC, SC, but more important it provides for a task trigger mechanism. e.g. on First Fill it will initiate the consumer task(s). The manifolds are more complex than the simple description above, as you can have multiple input ports and/or output port, each fed by one or more threads, with the trigger firing the consumer taskonly when all ports are satisfied and there is an available consumer. You can interlink the manifolds as you please (as long as it makes sense with your data/resource flow).
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Good catch ;)
Perhaps this should be "beer" with me.
Jim
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If it doesn't scale well, why go there at all?
"The TBB concurrent_queue does not have a task initiator on first fill situation."
Indeed, that would be nice to have, I've been thinking about this myself on several occasions.
"Sure you can do the select/poll, but this is equivalent to a spinwait (with task stealing)."
How so?
"One potential problem you have is when the I/O exceeds the processing capacity. In this case you may consume memory with unprocessed data. You can correct this by throttling the I/O thread (e.g. count, or starvation for buffers), but then you must insert domain communication (e.g. WaitForSingleEvent or condition variable(s)). This is unclean and open to programming error."
True, #29 was rather rash, even without the objection that parallel_while/do would jumble the data items between input and output (a single pipeline seems better, even if it imposes a global order above the desired per-channel order, probably adversely affecting latency, and it would definitely need complete messages as input to avoid outright starvation). But I hope I would have come to my senses early enough to try the relatively new TBB flow graph with libevent instead (although I secretly wish that Vishal doesn't do so before the current problem is diagnosed and perhaps solved...).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If it doesn't scale well, why go there at all?
When a process incorporates I/O, then scaling is not based upon cores alone. Rather it becomes an interrelationshipamongst:
threads (not cores)
I/O subsystem
O/S driver capability
Vishal's problem description has 16 files (at least from my understanding of these forum threads).
If you do not oversubscribe the threads, then the time interval (say) between the file read statement and the return from read (I/O completion) is a stall time for the issuing thread. IOW for each pending I/O you have one less compute thread available for doing work in your TBB thread pool. (assuming TBB threads areperforming I/O). To resolve this (within TBB), you oversubscribe threads under the assumption/requirement that you will have pending I/O operations. The degree of over-subscription would depend upon the I/O subsystem and the I/O latency compared against compute requirements. In situations where (sum of) I/O latencies is a very small amount of overall run-time, then you would not over-subscribe. As this ratio (I/O latency : compute) increases then you may find it beneficial to add additional threads.
Observing the Task Manager (or equivalent on Linux) can be used as an indicator for tuning the degree of over-subscription. In QuickThread this is a non-issue due to the dual classes of threads, and as a result the compute class can be maintained at the full compliment of available hardware threads.
FWIW When running the 8 concurrent pipelines, the I/O queue depth chart of the Resource Manager showed a jagy 7-9 queued writes on the output disk.Only 8 could be pending from my app, so this must be an O/S buffered lazy writeissue.The input disk showed ~0 queued reads. This indicates that the entire input file had been cached either by the disk read ahead or by the system file cache (system has more RAM than file size). Had this been run without the additional threads, then either computestarves or I/O starves.
The published specifications for this output drive was 120MB/s sustained writes. My single pipeline (Parallel Manifold) attained 124MB/s. The faster than published speed may be attributable tosweet spot on disk.The 8 pipeline performance aggregatewrite speed was 26.7MB/s, the drop in performance is due to seek latencies introduced by writing to 8 different output files (i.e. 8 separate output streams with significant seek distances as opposed to streaming to single file with track-to-track seeking). Had multiple output drives been available, then targeting multiple output drives may have been in order.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Raf, if you dig through my comments you will find that they are indicative of using more pipelines than your I/O subsystem is capable of _effectively_ using is counter productive. The 8-way pipeline using two disks shows a drop in aggregate throughput (to about half).
Additionally, the optimal tuning for I/O threads would be such that the disk queue would have no more entries than necessary to not fall to 0 (during peakperiods of low latency transactions).
To illustrate the point, I will tune for 8 compute threads, 1 I/O thread, two parallel pipelines (using Parallel Manifolds), each pipeline writing to separate disk....
1 I/O thread: 124.77MB/s (write), virtually no difference from using 4 threads (for dual pipeline)
2 I/O threads: 124.48
Reconfigure to use compute class threads for I/O (0 I/O threads), as would for TBB pipeline:
0 I/O threads: 65.71MB/s
Oversubscribe compute class by 1 thread, 0 I/O class threads
+1 compute, 0 I/O: 66.03MB/s
As you can observe, over-subscription of a single compute class (like with TBB), yields slight performancegain (from 65.71MB/s to 66.03MB/s). Where as using an additional threadseparated from compute class (like QuickThread) yields significant performance gain (65.71MB.s to 127.77MB/s).
I fully expect that, using TBB parallel pipeline, in conjunction with non-TBB threads for performing I/O, would yield similar improvement over using TBB threads for I/O.
A little extra work can pay big dividends.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Optimal tuning for physical devices seems like a difficult problem, perhaps beyond the scope of this forum. Still, your mileage may differ with different file systems: have you seen the same degradation with concurrent pipelines using, say, ZFS, even on the same number of disks (just from hearsay, though)?
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- « Previous
-
- 1
- 2
- Next »