- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
I have implemented an application that uses Intel TBB pipeline pattern for parallel processing on Intel Xeon CPU E5420 @ 2.50GHz running RHEL 6.
The application basically composes of 8 pipelines. Each pipeline has one token (making it one thread per pipeline). Each pipeline receives data from an endpoint and processes it to completion. I ran this application and collected general exploration analysis data using vTune amplifier. The profiler reported high CPI in finish_task_switch function of vmlinux module which suggests that the kernel is spending more time performing context switching and adversely affecting performance of the application.
What I would like to understand is why is the kernel performing high context switching? Will each pipeline be scheduled on the same CPU? Is there a way to assign CPU affinity with each pipeline? How can reduce this performance impacting behavior? Please provide some optimization tips.
Thank you,
Vishal
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Do you mean there is one end point for all 8 pipelines
or
8 end points, one for each of the 8 pipelines?
An example of the former (1 end point for all 8 pipelines) is reading a sequential file into buffer, then process buffer(s 8-ways). In this case you should have 1 pipeline, with at least 8 tokens.
If you have the latter (seperate end point per pipeline), then I suggest you eliminate the pipeline and run it asas 8 seperate tasks.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Are Input and Output performing I/O?
If so, the fread and fwrite (or other R/W API) will introduce a thread delay in which you would like to recover for productive use.
Each pipeline's tokens are seperate from the other pipelines tokens. Use at least 3 tokens per pipeline (5 would be better, any more may exhibit diminishing returns). Three provides for each pipeline to havea working thread working in the Process pipe, while the input and output threads are blocked waiting for I/O completion.
Oversubscribe the thread pool in the init(..). The amount of oversubscription would depend upon the number of Reads and Writesconcurrently blocked waiting for I/O completion. This is more of a tuning process rather than a fixed formula.
BTW, in QuickThread the Input and Output pipes would be scheduled on I/O class threads thus avoiding oversubscription of the compute class threads. Also, the QuickThread parallel_pipeline can be setup to be NUMA as well as CPU Affinity sensitive (see www.quickthreadprogramming.com).
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
No, potentially you are increasing the task count to 3*8=24 (to be performed by the available threads)
Note, IIF (if and only if) tasks (such as your input pipe, and output pipe) experience thread stalls due to I/O (as they will with Read and Write), then (then and only then) consider oversubscribing the thread count bythe number of threads that could (typically are) stalled at any point in time. You may have to experiment with the over-subscription count to find the sweet spot.
>>2. The input has to be processed in the order it is received.
Filters have three modes:
a) parallel (any order)
b) serial_out_of_order (one at a time, no particular order)
c) serial_in_order (one at a time, in sequence)
Make each of your pipe (filter stages) serial_in_order
This way, each stage can run concurrently in different threads, with the restriction of processing tokens in order received.
Tn = Thread n
Pn = Pipeline n
P0: T0 (Input, stalled at Read file R0), T1 (processing), T2 (output, stalled at Write file W0)
P1: T3 (Input, stalled at Read file R1), T4 (processing), T5 (output, stalled at Write file W1)
...
P7: T21 (Input, stalled at Read file R7), T22 (processing), T23 (output, stalled at Write file W7)
Notes,
Tokens of each pipeline circulate from output stage back to input stage.
Although 3 tokens per pipeline would be minimal, you may find it beneficial to use more, because at times you may experience high seek latencies performing I/O to 16 files. Experimentation will tell.
Also, latencies will change from system to system and disk drive placements of files (if you have that flexibility).
There is a difference between the TBB parallel_pipeline and QuickThread parallel_pipeline.
When a pipeline has an input filter (pipe segment), some number of processing filter(s), and and outupt filter, you can construct the pipeline such that the input pipe (receiving empty buffers) and run by a singleI/O class thread, and the output pipe (writing pipe), also run by a single, but different, I/O class thread, will write the pipes in collated order (collation order specified by the input pipe). This permits you to effectively have sequential input, parallel (any order) internal processing, sequential (and collated) output. This does require that your "process" function(s) be thread safe. If this is not suitable then you can make the internal pipes serial.
If you have much work invested in TBB I suggest you stick with TBB. If this is a new conversion of an application then you could experiment with QuickThread(one word, no "s" at the end).
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I think I've been on the side of doing I/O and computation on the same core once before, myself, but if I remember well it was deemed of little value. I guess it depends on how much computation there really is. If the computation is trivial, then I would forget about pipelines, certainly ones where the first and the third filter are merely about I/O that might stall a thread. If the computation is CPU-intensive, then I would certainly want to know whether it makes a real difference to have the data in cache first, and even then I would prefer to dedicate some threads to I/O without involving TBB yet, sidestepping the oversubscription workaround, since it seems difficult to have a situation that's both CPU-bound and memory-bound. Then maybe you could have a stage to warm up the data (if you're so convinced it's an issue that you would dedicate the development time to experiment with that), but didn't somebody invent hyperthreading (which by any other name smells as sweet) specifically to keep CPUs busy while waiting for memory in one of their hardware threads, as long as the data motion doesn't become cumulative and degenerates into thrashing, which doesn't seem to be the case here?
If I'm mistaken, somebody please convince me otherwise.
P.S.: I think Jim and myself basically gave the same advice about increasing the number of tokens (he less tersely so), so how about some feedback about the results of that?
(Edited: some trimming.)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You question is incomplete, perhaps you are missing a critical point about Tasks, Software Threadsand HardwareThreads (and pipelines).
A pipeline with 3 filters essentially represents 3 tasks. Tasks do not run until all:
a) they are enqueued
b)a hardware thread is scheduled by the O/S to run a software thread (of TBB)
c) a software thread within the TBB thread pool takes the enqueued task request
(provided it is not busy running some other task)
A pipeline with 0 tokens (abstract picture for you)
P: (task 1waiting for token), (task 2 waiting for token), (task 3 waiting for token)
-------------------------------------------------------
A pipeline with1 token (abstract picture for you)
At T=1
P: task 1(potentially) running (potentially stalled at Read), (task 2 waiting for token), (task 3 waiting for token
At T=2
P: task 1waiting for token, task 2(potentially) running, task 3 waiting for token
At T=3
P: task 1waiting for token, task 2 waiting for token, task 3 (potentially) running (potentially stalled at Write)
...back to T1...
Note, "(potentially) running" means running iif Hardware thread is available .AND. software thread is available (i.e. not running some other task).
With 1 token, one one of: Input, Process, Output can (potentially) be running (or potentially stalled in the case of Input or Output tasks).
With 3 tokens (and after first 2 have been read:
P: Task 1 (potentially) running (potentially stalled at Read), Task 2 (potentially running, Task 3 (potentially) running (potentially stalled at Write).
With 3 tokens you could (potentially) have a Read in progress, concurrent with a Process, concurrent with a Write in progress.
The P: description above is but one of the possible states. You could potentially have
P: Task 1 waiting for token, Task 2 two tokens in queue, one token(potentially) running, Task 3 waiting for token
By having more than 3 tokens, say 5, you could (potentially) have:
P: Task 1 (potentially) running (potentially stalled at Read), Task 2 two tokens in queue, one token(potentially) running, Task 3 (potentially) running (potentially stalled at Write)
The actual experience will vary from the above, but the above should give you a better description of what may happen.
Your application (from your sketch) has 8 input files and 8 output files (potentially 16 I/Os in flight). Should your system have but 1 spindle (one disk) you will be experiencing large seek latencies. To reduce seek latencies you might consider having each file read several buffers in a row, send each buffer (token)tothe Processpipe, then have each file write several buffers. The TBB parallel_pipeline is not configurable to do this (neither is the QuickThread parallel_pipeline), however you can recode to do something like this sketch:
Input pipe:
// Read 1 to 4 buffers (short reads on EOF)
for(Token.nBuffers = 0; Token.nBuffers < 4; ++Token.nBuffers)
if(Read(Token.buffer, Token.nBuffers) break; // break on EOF
Process pipe:
if(Token.nBuffers)
{
parallel_invoke( //or use parallel_for_each, or...
[&](){ Process(Token.buffer, 0); },
[&](){ if(Token.nBuffers > 1) Process(Token.buffer, 1); },
[&](){ if(Token.nBuffers > 2) Process(Token.buffer, 2); },
[&](){ if(Token.nBuffers > 3) Process(Token.buffer, 3); });
}
Output pipe:
for(int i=0; i
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Did you try to check priorities of your main process and all threads?
There is a possibility that TBB "switched" priorities to a HIGH_PRIORITY_CLASS for the process
and THREAD_PRIORITY_HIGHEST for all threads. That is why there are so many context switches.
On Windows platformsthese Win32 API functions will getpriorities:
...
if( ::GetPriorityClass( ::GetCurrentProcess() ) == HIGH_PRIORITY_CLASS )
...
if( ::GetThreadPriority( ::GetCurrentThread() ) == THREAD_PRIORITY_HIGHEST )
...
Note: I just realized that you dothe jobin Linux...
At lower priorities processes and threads will have less contextswitches and willspend more time
on processing of your data.
Best regards,
Sergey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Then again, this (undocumented) priority boost is only performed "#if _MSC_VER", and the platform in question is Linux.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Messing around with thread priorities is often counter-productive. The natural tendency is "my program is more important than yours/someone else's", therefore, I set all my threads to highest priority. And all other programmers make the same decision.
What is not stated by Vishal is if he is doing something he ought not to be doing (which is obvious once you know your have been bitten). Pseudo code
main()
{
spawn 8 threads (pthreads, _beginthread, whatever)
{ // each thread
tbb::init(...
parallel_pipeline(...
}
join
}
In the above, you will be generating 8 TBB contexts. IOW you will be oversubscribed by 8x
Messing around with thread priorities "can be" productive, in circumstances where you know your application has oversubscribed threads .AND. you know which threads need a boost. In TBB, tasks are not in control of which thread takes the task. Therefore you have no advanced way of knowing which thread's priority to boost _prior_ to it taking the task request. This results in your only choice isto raise the priority of all your application's TBB thread pool priorities and thus getting into a shoving match with other applications (or your own application in the event you choose to oversubscribe).
In TBB, you can resolve this in two ways:
a) Use extra non-TBB threads to perform non-TBB task work at higher priority. Doing so introduces a domain issue and how to migrate work requests between domains (in particular start/resume TBB tasks). While this is not particularly difficult to do, it is not built in to the architecture of TBB.
b) Add thread priority boost at task termination, add thread priority reduction at task start for lower priority tasks. This adds overhead in constantly readjusting thread priorities.
BTW in QuickThread this is a non-Issue since it has two classes of threads (higher priority I/O classand compute class) and task enqueues can choose a designated class.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Did you try to execute the application on another computer with a different CPU, or with a different Linux OS?
- How much RAM and VM ( Virtual Memory ) doesit use?
- What about a size of VM?
- It is not clear how bigare data sets?
Best regards,
Sergey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
a) our main() is instantiating 8 (NUMDEST) boost threads
b) Each boost thread is, with respect to TBB, a concurrent "main()", although TBB will not be aware of any concurrancy.
c) Each boost thread issues task_scheduler_init, which allocates a full set of threads. (2x E5420 = 8 cores/hw threads). Making 64 threads.
Consider using
task_scheduler_init init(3); // or 2 or 1
Keep in mind that using 3, will create 24 threads (23 + 1 for main). 16 of which will be performing I/O. If the I/O requests are the bottleneck then you will NOT have to worry about "spinwaits". However, if the compute section is the bottleneck, then when the I/O tasks run out of tokens, they will enter a spinwait waiting for an additional token, and this time will be wasted. You might consider reducing the number ofTBB threads per boost thread (changeinit(3) to init(2)), or editing the TBB source to reduce the spinwait time (it appears to be hardwired, correct me if I am wrong).
***
A better route to take is to NOT use boostthreads.
Stay within TBB.
Experiment with something like:
main()
{
task_scheduler_init init(nHWthreads + pendingIoThreads); // ?? 8 + ?4 ??
parallel_invoke(
[&]() {doPipeline(0); },
[&]() {doPipeline(1); },
[&]() {doPipeline(2); },
[&]() {doPipeline(3); },
[&]() {doPipeline(4); },
[&]() {doPipeline(5); },
[&]() {doPipeline(6); },
[&]() {doPipeline(7); });
}
void doPipeline(int n)
{
Flows Flow;
Flow.SetupFlow(n); // *** remove boost thread creation
Flow.StartFlow();
}
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
task_scheduler_init is a reference to a shared structure, so only 7 TBB worker threads will be added in total. The problem must be something else.
"A better route to take is to NOT use boostthreads."
That will not provide any required concurrency.
I don't see what would cause the problem, but I haven't really looked at pipeline for a while, and it has changed somewhat since then. There's a yield operation in there that may have something to do with it, but Vishal has kept the pace of trying things extremely slow so far even without blocking (if I may callously make a TBB joke here). The second experiment, after increasing the number of tokens, would be to keep that number at one and make the filters parallel, or apply the correct setting for each filter individually and try various numbers of tokens.
Further out, something may be tried with explicitly executing a filter on a specific thread.
Only afterwards would it seem useful to go deeper into the implementation to find out.
I just hope it's not some unavoidable consequence of keeping the arenas separate, which was done at least partially to avoid any entanglement between pipelines run from different threads that would destroy guaranteed concurrency.
Of course, if you don't have the time or inclination to do any of that, just use plain old threads, because I'm not sure this program is making any use of what TBB has to offer.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page