Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.
2478 Discussions

CPU affinity of a pipeline

Vishal_Sharma
Beginner
3,096 Views

Hello,

I have implemented an application that uses Intel TBB pipeline pattern for parallel processing on Intel Xeon CPU E5420 @ 2.50GHz running RHEL 6.

The application basically composes of 8 pipelines. Each pipeline has one token (making it one thread per pipeline). Each pipeline receives data from an endpoint and processes it to completion. I ran this application and collected general exploration analysis data using vTune amplifier. The profiler reported high CPI in finish_task_switch function of vmlinux module which suggests that the kernel is spending more time performing context switching and adversely affecting performance of the application.

What I would like to understand is why is the kernel performing high context switching? Will each pipeline be scheduled on the same CPU? Is there a way to assign CPU affinity with each pipeline? How can reduce this performance impacting behavior? Please provide some optimization tips.

Thank you,

Vishal

0 Kudos
36 Replies
jimdempseyatthecove
Honored Contributor III
1,997 Views
>>Each pipeline receives data from an endpoint and processes it to completion.

Do you mean there is one end point for all 8 pipelines
or
8 end points, one for each of the 8 pipelines?

An example of the former (1 end point for all 8 pipelines) is reading a sequential file into buffer, then process buffer(s 8-ways). In this case you should have 1 pipeline, with at least 8 tokens.

If you have the latter (seperate end point per pipeline), then I suggest you eliminate the pipeline and run it asas 8 seperate tasks.

Jim Dempsey
0 Kudos
Vishal_Sharma
Beginner
1,997 Views
Thank you for a response.
It is 8 end points, one for each of the 8 pipelines.
Do I have to assign CPU affinity when running as 8 seperate tasks? Will the scheduler always schedule the task to the same Core?
Thank you,
Vishal
0 Kudos
RafSchietekat
Valued Contributor III
1,997 Views
You don't say anything about the filters in the pipeline (how many, what kind), and what happens if you increase the number of tokens (just 1 doesn't make for much of a pipeline)...
0 Kudos
Vishal_Sharma
Beginner
1,997 Views
There are three filters per pipeline. Input, Process and Output. All the filters are serial. I have not tried increasing the tokens. The reason is that I want to processing to be similar to like having 8 threads and working on data from their own endpoint.

Regards,
Vishal
0 Kudos
RafSchietekat
Valued Contributor III
1,997 Views
You cannot expect TBB to do its thing if you don't let it.
0 Kudos
jimdempseyatthecove
Honored Contributor III
1,997 Views
>>There are three filters per pipeline. Input, Process and Output.

Are Input and Output performing I/O?

If so, the fread and fwrite (or other R/W API) will introduce a thread delay in which you would like to recover for productive use.

Each pipeline's tokens are seperate from the other pipelines tokens. Use at least 3 tokens per pipeline (5 would be better, any more may exhibit diminishing returns). Three provides for each pipeline to havea working thread working in the Process pipe, while the input and output threads are blocked waiting for I/O completion.

Oversubscribe the thread pool in the init(..). The amount of oversubscription would depend upon the number of Reads and Writesconcurrently blocked waiting for I/O completion. This is more of a tuning process rather than a fixed formula.

BTW, in QuickThread the Input and Output pipes would be scheduled on I/O class threads thus avoiding oversubscription of the compute class threads. Also, the QuickThread parallel_pipeline can be setup to be NUMA as well as CPU Affinity sensitive (see www.quickthreadprogramming.com).

Jim Dempsey
0 Kudos
Vishal_Sharma
Beginner
1,997 Views
Yes, the input and output are performing I/O.

Increasing the tokens per pipeline to 3 seems like a good experiment. Though I have questions about that.
1. If I increase the tokens to 3 then effectively I am increasing the thread count to 3*8=24. And If I have only 8 logical core won't this cause cache thrashing and degrade the performance?
2. The input has to be processed in the order it is received. Having 3 tokens potentially means that there could be two tokens processing their own inputs at the same time. (Processing time and sequence of steps to process each input depends on the kind of input.) Will the order for processing be maintained? I am new to this concept and I do not understand it well enough.
In the meantime, I will start experimenting by increasing the tokens. I will also look into the Quick Threads.
Thank you,
Vishal Sharma.
0 Kudos
jimdempseyatthecove
Honored Contributor III
1,997 Views
>>increasing the thread count to 3 then effectively I am increasing the thread count to 3*8=24

No, potentially you are increasing the task count to 3*8=24 (to be performed by the available threads)

Note, IIF (if and only if) tasks (such as your input pipe, and output pipe) experience thread stalls due to I/O (as they will with Read and Write), then (then and only then) consider oversubscribing the thread count bythe number of threads that could (typically are) stalled at any point in time. You may have to experiment with the over-subscription count to find the sweet spot.

>>2. The input has to be processed in the order it is received.

Filters have three modes:

a) parallel (any order)
b) serial_out_of_order (one at a time, no particular order)
c) serial_in_order (one at a time, in sequence)

Make each of your pipe (filter stages) serial_in_order

This way, each stage can run concurrently in different threads, with the restriction of processing tokens in order received.

Tn = Thread n
Pn = Pipeline n

P0: T0 (Input, stalled at Read file R0), T1 (processing), T2 (output, stalled at Write file W0)
P1: T3 (Input, stalled at Read file R1), T4 (processing), T5 (output, stalled at Write file W1)
...
P7: T21 (Input, stalled at Read file R7), T22 (processing), T23 (output, stalled at Write file W7)

Notes,

Tokens of each pipeline circulate from output stage back to input stage.
Although 3 tokens per pipeline would be minimal, you may find it beneficial to use more, because at times you may experience high seek latencies performing I/O to 16 files. Experimentation will tell.
Also, latencies will change from system to system and disk drive placements of files (if you have that flexibility).

There is a difference between the TBB parallel_pipeline and QuickThread parallel_pipeline.

When a pipeline has an input filter (pipe segment), some number of processing filter(s), and and outupt filter, you can construct the pipeline such that the input pipe (receiving empty buffers) and run by a singleI/O class thread, and the output pipe (writing pipe), also run by a single, but different, I/O class thread, will write the pipes in collated order (collation order specified by the input pipe). This permits you to effectively have sequential input, parallel (any order) internal processing, sequential (and collated) output. This does require that your "process" function(s) be thread safe. If this is not suitable then you can make the internal pipes serial.

If you have much work invested in TBB I suggest you stick with TBB. If this is a new conversion of an application then you could experiment with QuickThread(one word, no "s" at the end).

Jim Dempsey
0 Kudos
Vishal_Sharma
Beginner
1,997 Views
Jim,
Thank you for your response. Really appreciate your help.
One question - When a pipeline is created with 3 filters and number of token is set to 1 - does this mean that three tasks exist and they are being processed by three threads?
Thank you,
Vishal
0 Kudos
RafSchietekat
Valued Contributor III
1,997 Views
I'm not convinced yet.

I think I've been on the side of doing I/O and computation on the same core once before, myself, but if I remember well it was deemed of little value. I guess it depends on how much computation there really is. If the computation is trivial, then I would forget about pipelines, certainly ones where the first and the third filter are merely about I/O that might stall a thread. If the computation is CPU-intensive, then I would certainly want to know whether it makes a real difference to have the data in cache first, and even then I would prefer to dedicate some threads to I/O without involving TBB yet, sidestepping the oversubscription workaround, since it seems difficult to have a situation that's both CPU-bound and memory-bound. Then maybe you could have a stage to warm up the data (if you're so convinced it's an issue that you would dedicate the development time to experiment with that), but didn't somebody invent hyperthreading (which by any other name smells as sweet) specifically to keep CPUs busy while waiting for memory in one of their hardware threads, as long as the data motion doesn't become cumulative and degenerates into thrashing, which doesn't seem to be the case here?

If I'm mistaken, somebody please convince me otherwise.

P.S.: I think Jim and myself basically gave the same advice about increasing the number of tokens (he less tersely so), so how about some feedback about the results of that?

(Edited: some trimming.)
0 Kudos
jimdempseyatthecove
Honored Contributor III
1,997 Views
>>When a pipeline is created with 3 filters and number of token is set to 1 - does this mean that three tasks exist and they are being processed by three threads?

You question is incomplete, perhaps you are missing a critical point about Tasks, Software Threadsand HardwareThreads (and pipelines).

A pipeline with 3 filters essentially represents 3 tasks. Tasks do not run until all:

a) they are enqueued
b)a hardware thread is scheduled by the O/S to run a software thread (of TBB)
c) a software thread within the TBB thread pool takes the enqueued task request
(provided it is not busy running some other task)

A pipeline with 0 tokens (abstract picture for you)

P: (task 1waiting for token), (task 2 waiting for token), (task 3 waiting for token)
-------------------------------------------------------
A pipeline with1 token (abstract picture for you)

At T=1
P: task 1(potentially) running (potentially stalled at Read), (task 2 waiting for token), (task 3 waiting for token

At T=2
P: task 1waiting for token, task 2(potentially) running, task 3 waiting for token

At T=3
P: task 1waiting for token, task 2 waiting for token, task 3 (potentially) running (potentially stalled at Write)

...back to T1...

Note, "(potentially) running" means running iif Hardware thread is available .AND. software thread is available (i.e. not running some other task).

With 1 token, one one of: Input, Process, Output can (potentially) be running (or potentially stalled in the case of Input or Output tasks).

With 3 tokens (and after first 2 have been read:

P: Task 1 (potentially) running (potentially stalled at Read), Task 2 (potentially running, Task 3 (potentially) running (potentially stalled at Write).

With 3 tokens you could (potentially) have a Read in progress, concurrent with a Process, concurrent with a Write in progress.

The P: description above is but one of the possible states. You could potentially have

P: Task 1 waiting for token, Task 2 two tokens in queue, one token(potentially) running, Task 3 waiting for token

By having more than 3 tokens, say 5, you could (potentially) have:

P: Task 1 (potentially) running (potentially stalled at Read), Task 2 two tokens in queue, one token(potentially) running, Task 3 (potentially) running (potentially stalled at Write)


The actual experience will vary from the above, but the above should give you a better description of what may happen.

Your application (from your sketch) has 8 input files and 8 output files (potentially 16 I/Os in flight). Should your system have but 1 spindle (one disk) you will be experiencing large seek latencies. To reduce seek latencies you might consider having each file read several buffers in a row, send each buffer (token)tothe Processpipe, then have each file write several buffers. The TBB parallel_pipeline is not configurable to do this (neither is the QuickThread parallel_pipeline), however you can recode to do something like this sketch:

Input pipe:
// Read 1 to 4 buffers (short reads on EOF)
for(Token.nBuffers = 0; Token.nBuffers < 4; ++Token.nBuffers)
if(Read(Token.buffer, Token.nBuffers) break; // break on EOF

Process pipe:
if(Token.nBuffers)
{
parallel_invoke( //or use parallel_for_each, or...
[&](){ Process(Token.buffer, 0); },
[&](){ if(Token.nBuffers > 1) Process(Token.buffer, 1); },
[&](){ if(Token.nBuffers > 2) Process(Token.buffer, 2); },
[&](){ if(Token.nBuffers > 3) Process(Token.buffer, 3); });
}

Output pipe:
for(int i=0; i Write(Token.buffer, i);

Jim Dempsey
0 Kudos
SergeyKostrov
Valued Contributor II
1,997 Views
>>...why is the kernel performing high context switching?...

Did you try to check priorities of your main process and all threads?

There is a possibility that TBB "switched" priorities to a HIGH_PRIORITY_CLASS for the process
and THREAD_PRIORITY_HIGHEST for all threads. That is why there are so many context switches.

On Windows platformsthese Win32 API functions will getpriorities:

...
if( ::GetPriorityClass( ::GetCurrentProcess() ) == HIGH_PRIORITY_CLASS )
...
if( ::GetThreadPriority( ::GetCurrentThread() ) == THREAD_PRIORITY_HIGHEST )
...

Note: I just realized that you dothe jobin Linux...

At lower priorities processes and threads will have less contextswitches and willspend more time
on processing of your data.

Best regards,
Sergey
0 Kudos
RafSchietekat
Valued Contributor III
1,997 Views
So while Vishal still hasn't done a quick test to easily eliminate one possibility, while mainly Jim is spending a lot more of his time writing lengthy arguments, the problem may have been hiding elsewhere...

Then again, this (undocumented) priority boost is only performed "#if _MSC_VER", and the platform in question is Linux.
0 Kudos
jimdempseyatthecove
Honored Contributor III
1,997 Views
>>the problem may have been hiding elsewhere...

Messing around with thread priorities is often counter-productive. The natural tendency is "my program is more important than yours/someone else's", therefore, I set all my threads to highest priority. And all other programmers make the same decision.

What is not stated by Vishal is if he is doing something he ought not to be doing (which is obvious once you know your have been bitten). Pseudo code

main()
{
spawn 8 threads (pthreads, _beginthread, whatever)
{ // each thread
tbb::init(...
parallel_pipeline(...
}
join
}

In the above, you will be generating 8 TBB contexts. IOW you will be oversubscribed by 8x

Messing around with thread priorities "can be" productive, in circumstances where you know your application has oversubscribed threads .AND. you know which threads need a boost. In TBB, tasks are not in control of which thread takes the task. Therefore you have no advanced way of knowing which thread's priority to boost _prior_ to it taking the task request. This results in your only choice isto raise the priority of all your application's TBB thread pool priorities and thus getting into a shoving match with other applications (or your own application in the event you choose to oversubscribe).

In TBB, you can resolve this in two ways:

a) Use extra non-TBB threads to perform non-TBB task work at higher priority. Doing so introduces a domain issue and how to migrate work requests between domains (in particular start/resume TBB tasks). While this is not particularly difficult to do, it is not built in to the architecture of TBB.

b) Add thread priority boost at task termination, add thread priority reduction at task start for lower priority tasks. This adds overhead in constantly readjusting thread priorities.

BTW in QuickThread this is a non-Issue since it has two classes of threads (higher priority I/O classand compute class) and task enqueues can choose a designated class.

Jim Dempsey
0 Kudos
SergeyKostrov
Valued Contributor II
1,997 Views
>>...Please provide some optimization tips...

- Did you try to execute the application on another computer with a different CPU, or with a different Linux OS?

- How much RAM and VM ( Virtual Memory ) doesit use?

- What about a size of VM?

- It is not clear how bigare data sets?

Best regards,
Sergey
0 Kudos
Vishal_Sharma
Beginner
1,997 Views
Hello,
Sorry for not been able to perform the test by changing the token count to 3 or 5. I was out sick. I am back this morning and will try that today and post the results.
I appreciate all your comments and help. It is really helpful as I learn these new concepts.
Thank you,
Vishal
0 Kudos
Vishal_Sharma
Beginner
1,997 Views
Jim/et al,
Here is my psuedo code, if it helps understanding my implementation:
Here are my definitions of Input, Process and Output classes. Each of these are derived from tbb::filter class.
class Input : public tbb::filter {
public:
Input(bool serialInput, int numtokens, int destid=0);
virtual ~Input();
void* operator()(void*);
protected:
void AllocateDataHandles(int tokens);
private:
int tokens;
int next;
int DestID;
vector flowData;
};
class Process : public tbb::filter {
public:
Process(bool serialInput);
virtual ~Process();
void* operator()(void * rawdata);
};
class Output : public tbb::filter {
public:
Output(bool serialInput);
virtual ~Output();
void* operator()(void* processeddata);
};
Here is my flow thread that instantiates the Input, Process, Output filters and pipeline.
class Flows {
public:
Flows();
virtual ~Flows();
void SetupFlow(int id) {
_input = new Input(true, 1, id);
_process = new Process(true);
_output = new Output(true);
_pipeline.add_filter(*_input);
_pipeline.add_filter(*_process);
_pipeline.add_filter(*_output);
assert(!_thread);
_thread = boost::shared_ptr<:THREAD>(new boost::thread(boost::bind(&Flows::StartFlow, this)));
assert(_thread);
}
void StartFlow() {
intnumThrds = 1;
task_scheduler_init init();
_pipeline.run(numThrds);
}
void StopFlow();
private:
boost::shared_ptr<:THREAD> _thread;
pipeline _pipeline;
filter* _input;
filter* _process;
filter* _output;
};
#DEFINENUMDEST 8
int main() {
vector flows;
Flows* flow;
for (int i = 0; i < NUMDEST; i++) {
flow= new Flows();
flow->SetupFlow(i);
flows.push_back(flow);
}
...
...
...
return 0;
}
As you can see I am using the default constructor to initializetask scheduler. No thread counts are given. I am assuming that the number of token per pipeline will dictate the number of threads that will exist.

Please tell me if I am doing anything whichis not suppose to be done when using the Pipeline pattern for the TBB.

Appreciate your comments.

Thank you,

Vishal
0 Kudos
Vishal_Sharma
Beginner
1,997 Views
I have changed the number of tokens to 3. With this, I ran the application and started analysis using vTune. This time I noticed high CPI rate, Retire Stalls, Instruction Starvation, Execution Stalls in TBB Scheduler Internals from the function:
tbb::internal::custom_scheduler<:INTERNAL::INTELSCHEDULERTRAITS>::receive_or_steal_task(long&, bool)

Also, thread_return function of Linux Kernel reported high instruction starvation.
Regards,
Vishal
0 Kudos
jimdempseyatthecove
Honored Contributor III
1,997 Views
Here is your problem:

a) our main() is instantiating 8 (NUMDEST) boost threads

b) Each boost thread is, with respect to TBB, a concurrent "main()", although TBB will not be aware of any concurrancy.

c) Each boost thread issues task_scheduler_init, which allocates a full set of threads. (2x E5420 = 8 cores/hw threads). Making 64 threads.

Consider using

task_scheduler_init init(3); // or 2 or 1

Keep in mind that using 3, will create 24 threads (23 + 1 for main). 16 of which will be performing I/O. If the I/O requests are the bottleneck then you will NOT have to worry about "spinwaits". However, if the compute section is the bottleneck, then when the I/O tasks run out of tokens, they will enter a spinwait waiting for an additional token, and this time will be wasted. You might consider reducing the number ofTBB threads per boost thread (changeinit(3) to init(2)), or editing the TBB source to reduce the spinwait time (it appears to be hardwired, correct me if I am wrong).

***
A better route to take is to NOT use boostthreads.

Stay within TBB.

Experiment with something like:

main()
{
task_scheduler_init init(nHWthreads + pendingIoThreads); // ?? 8 + ?4 ??
parallel_invoke(
[&]() {doPipeline(0); },
[&]() {doPipeline(1); },
[&]() {doPipeline(2); },
[&]() {doPipeline(3); },
[&]() {doPipeline(4); },
[&]() {doPipeline(5); },
[&]() {doPipeline(6); },
[&]() {doPipeline(7); });
}

void doPipeline(int n)
{
Flows Flow;
Flow.SetupFlow(n); // *** remove boost thread creation
Flow.StartFlow();
}

Jim Dempsey
0 Kudos
RafSchietekat
Valued Contributor III
1,776 Views
"Each boost thread issues task_scheduler_init, which allocates a full set of threads. (2x E5420 = 8 cores/hw threads). Making 64 threads."
task_scheduler_init is a reference to a shared structure, so only 7 TBB worker threads will be added in total. The problem must be something else.

"A better route to take is to NOT use boostthreads."
That will not provide any required concurrency.

I don't see what would cause the problem, but I haven't really looked at pipeline for a while, and it has changed somewhat since then. There's a yield operation in there that may have something to do with it, but Vishal has kept the pace of trying things extremely slow so far even without blocking (if I may callously make a TBB joke here). The second experiment, after increasing the number of tokens, would be to keep that number at one and make the filters parallel, or apply the correct setting for each filter individually and try various numbers of tokens.

Further out, something may be tried with explicitly executing a filter on a specific thread.

Only afterwards would it seem useful to go deeper into the implementation to find out.

I just hope it's not some unavoidable consequence of keeping the arenas separate, which was done at least partially to avoid any entanglement between pipelines run from different threads that would destroy guaranteed concurrency.

Of course, if you don't have the time or inclination to do any of that, just use plain old threads, because I'm not sure this program is making any use of what TBB has to offer.
0 Kudos
Reply