threading problem

david_ch_lin · ‎06-09-2011

Hi,
A procedure in my program is multi-threaded.
Although the elapsed time of the procedure reduced, the total elapsed time is longer.
I tried to use numactl --physcpubind=... to run my program, the total elapsed time in multi-thread mode is really shorter.
When using amplifier for analysis, I found the number of LLC cache miss increased,
but this happens even if I do nothing in my multi-threaded procedure.
(I replace a dummy procedure for multi-threading)
My machine is equiped with two Xeon CPU E5640, and my OS is Red Hat Enterprise 5.5, linux kernel : 2.6.18-194.el5
My machine is isolated so that I am sure that there is no any other havy jobs run concurrently.
Any idea about the problem?
Thanks a lot.

gaston-hillar · ‎06-09-2011

Hey David,

If you provide more information about your program, it will be easier to help. In fact, you might start with the programming language, an example of the code and the compiler.

It is really difficult to provide serious feedback with the info you're providing in your question.

Cheers,

Gaston

david_ch_lin · ‎06-09-2011

Hi, Gaston,

Thanks for your feedback.
I know the info provided is very few.
However, it is not easy to describe my program very well. (I will try to describe the scenario though)
Basically, my program flow is like :
while(in some condition) {
procA();
procB();
procC();
}
and I want to multi-thread procB while procA and procC are executed in the main thread only.
I have a thread pool and the threads are created in the very begining of the program (before entering the while loop)
Each of them (except the main thread) are in a infinite loop waiting for the task.
The weird thing is although the elapsed time of procB after multi-threaded is shortened,
the time of other parts are longer.
That is, before multi-threading, procA takes 200 sec, procB taks 100 sec, and procC takes 200 sec => total 500 sec
After multi-threading(4 threads), procA takes 220 sec, procB takes 60 sec, and procC takes 230 sec => total 510 sec
As I said, the machine is isolated so that I am sure that there is no other heavy jobs running concurently.
I also tried to pin the CPUs for the test run by "numactl --physcpubind=5-8 ...", the result is good :
procA and procC is slightly different and the total elapsed time reduced (due to the shorter procB elapsed time)
Is there any reason for the condition?
Cache miss? But what procB does is read some data and output a new value for the next iteration. (it should not cause other parts of the program suffer so much?!)
Or, the protion of procB is too small, so that overhead of multi-threading eating the performance?
Any recommendation or even guess are welcome.
Thanks.

jimdempseyatthecove · ‎06-10-2011

Sounds like your wait loops in your thread pool threads are causing the slowdown of procA and procC.

Have you tried using _mm_pause() (or asm version ofthe PAUSE instruction) in your wait loop?

If you are on a system with variable CPU clock dependent on temperature then the method of your waitloop may be generating too much heat (causing CPU clock to slow down).

An alternat cause is excessive use of interlocked operations (i.e. instructions containing the LOCK prefix). These can cause increased latencies.

Jim Dempsey

david_ch_lin · ‎06-13-2011

Hi,
I ever tried to use an empty multi-threaded procedure.
i.e., whole program is in single thread except in some points, a multi-threaded procedure is called but actually does nothing.
The overall performance is ok with slightly multi-threading overhead. => wait loops may not be the root cause.
Besides, I turned off turbo boost, speedstep, and C states control form BIOS so all CPUs' clock should be the same.
A clue may be helpful : when I use numactl ----physcpubind=4-7 ..., the performance is very good(should be all local memory access). If use numactl ----physcpubind=2-5 ..., the performance just degraded as what I described.(cross CPUs and may have remote memory access)
So, the problem is about memory bandwidth?
Maybe that part (procB) is really not sufficient for multi-threading.

jabbirbasha · ‎06-13-2011

hi,
i think the wait loop causing the slow down ....................................

david_ch_lin · ‎06-14-2011

Hi, all,
I modified the wait loop with sem_wait() and sem_post() mechanism but the slow down situation is the same.
Basically, I have 8 cores, if the thread number does not exceed the core numbers, it will assign
a thread to a core, right? So, I do not think it will interfere main thread's performance.
Correct me if I am wrong. Thanks.

jimdempseyatthecove · ‎06-14-2011

I think you oversimplified your description resulting in confused readers of your post.

while(in some condition) {
procA();
procB();
procC();
}
and I want to multi-thread procB while procA and procC are executed in the main thread only.

This would yield us to believe (ala TBB-speak)

while(in some condition ) {
parallel_invoke(
[&](){ procA(); procC(); }, // group A,C
[&](){ procB(); } ); //group B
}

Then you say:
But what procB does is read some data and output a new value for the next iteration.

This would seem to imply (require) procA to assure prior iteration procB has completed as well as procB potentially having to wait until procA and/or procC having captured current iteration data. Do you have waits relating to these conditions?

Jim Dempsey

david_ch_lin · ‎06-14-2011

Hi,

ok, let me clarify something.
My program is originally sequential : procA(), then procB(), then procC();
and they are all in a loop executed again and again.
And for some reasons, we decided just to parallelize procB.
the sequence of procedures is still the same.
Therefore, the threads in the thread pool will be idle when the program is in step procA and procC.
When step into procB, the main thread will first divide a bunch of tasks into several groups
and use sem_post() to "wake up" threads in the thread pool.
Each thread will handle some groups of tasks and then idle again (using sem_wait()) after all tasks are done.
Let me know if my description still confuses you.
Thanks.

jimdempseyatthecove · ‎06-14-2011

OK

you had

ProcA();
ProcB();
ProcC();

all serial with

"That is, before multi-threading, procA takes 200 sec, procB taks 100 sec, and procC takes 200 sec => total 500 sec"

Then you modified ProcB only and observed:

"'After multi-threading(4 threads), procA takes 220 sec, procB takes 60 sec, and procC takes 230 sec => total 510 sec"

Actually you modified more than just procB as you also said you start your thread pool before the loop.

This indicates your auxilliary threads are gumming up your performance somehow. To diagnose where the overhead is originating from leave in the code that starts your thread pool and replace your parallel procB with your serial procB.

IOW: start your thread pool, enter your loop executing your serial code, exit loop, take timing. I will guess you will see a slow down in all three procs. This will indicate that your thread idel state isn't idel and you will have to look at your code to see why.

Jim Dempsey

david_ch_lin · ‎06-15-2011

Hi,
I tested the program in the way you said.
NO slow down in every part.

jimdempseyatthecove · ‎06-15-2011

Here is a guess:

Many people see and example of pthread programming where the main thread creates one worker thread per available core. Usually the threads are started in the run state, but can be in the wait state. This group of threads is their work group (task group). The main thread's duties then degrade to wait for work group to finish (started in run) or wake-up work group then wait for work group to finish. IOW main thread is not participating as member of work group.

The better routeis for the main thread to create 1 less number of threads for the work group and then participate in each session the work group works in. By programming this way you have a better probability of maintaining data in cache as well as one less thread to start/wait for.

As to if this adds up to ~10% overhead that you are observing I cannot say.

Jim Dempsey

david_ch_lin · ‎06-15-2011

Hi,

Actually, that is what we do.
Our main thread will participate in the work group.

I tried another part in my program for parallelism and found that the situation I described before disappeared.
The different is that, this time, this part is very independent.
The reason why the part is not chosen for parallelism is it is not critical.
Back to my original problem, I think it is cache problem.
After digging the program, I found both procA and procB will access (read and write) same area of data.
porcA need to reload those data because they are modified by procB.
Although the elapsed time for procB is shorter, it is just because they are multi-threaded.
And I think this is why numactl --physcpubind=4-7 can help but numactl --physcpubind=2-5 can not.

jimdempseyatthecove · ‎06-16-2011

Here is a potential soluton to your slowdown problem

Write procB such that it divides work up where the main thread (the one that runs procA and procC) gets less work.
Have the other threads running longer portions of procB write a completion flag for that threads secton of shared date. The main thread, when it finishes its shorter section of procB enters a poll loop waiting for the other threads to complete. As the main thread notices each completion (in random order) it reads the completed data from the other worker thread (does this once), and this repeats until all additional threads complete (or experiment exiting when one thread remaining).

What this will do is attempt to re-populate the main thread's cache prior to running procC. This repopulation overhead will occure while some of the remaining threads are still working on their sections of procB. You might see a net gain in performanc.

Hint, if even partitioning (excepting for main thread's lesser partition) yields meger returns, then try unequal partitioning such that the additional worker threads different amount of work such that they finish at different times (with the time difference approximately equal to the time for the main thread to repopulate its cache from that threads data).

Jim Dempsey

rudaho · ‎06-21-2011

Hi Jim~

Your method sounds interesting. But I have some concerns about it and I'd like to discribe what you said first to ensure I get the picture. If I misunderstand your concept, please correct me then.

original:
procA(); /* run in main thread */
procB();
procC(); /* run in main thread */

You want to divide procB() into small tasks, let me say, procB1(), procB2(), procB3() and procB4(). The 4 small tasks can be performed simultaneously. Let procB1() have smallest loading, that is one thread can finish procB1() more quickly than other tasks.

Then the flow will be:
1). main thread perform procA();
2). main thread divide procB() into procB1(), procB2(), procB3() and procB4() and put procB2(), procB3() and procB4() into the pool to wait for other theads to execute.
3). main thread perform procB1() => at mean while, other threads are perform procB2() ~ procB4();
4). main thread will finish procB1() much earlier than other threads.
5). once any other thread finishes their task (procB2() ~ procB4()), main thread will touch the data from those task.
6). All small tasks are done, main thread perform procC();

The idea wil be prefetch the data in cache for main thread to use in procC(). Is that what you mean?

If so, my question is since procB1() occupied smaller portion of the total task. The overall performance for procB() will be longer (since procB2(), procB3() and procB4() are longer). And after other tasks are finished, main thread need to prefetch the data in cache. That means we'll have an overhead to touch the data after all the threading jobs are finished. Do you think this overhead is not significnat and the total performance can be improved by further reduced cache miss. Thanks...

BR

Yi-Ju

david_ch_lin · ‎06-21-2011

Hi,

I think Jim's idea is helpful.
Although the scalibity is not good in multithreading mode, but it really reduces something.
For Yi-Ru's question, I have some explanations:
1. The group number (say 20) is larger than the thread number (say 4)
2. The main thread may only picks up a group, then it enters the state doing re-populate the cache data.
3. Since the work load is even for every group, the main thread should have work to do immediatelly
4. The re-population work is parallel. It costs longer time to finish procB() but still shorter than the serial mode.
Another question is that is there any light-weight re-population method so that I can reduce the overhead of main thread? (I currently use bcopy())

BR,
David

eswari · ‎06-22-2011

This is really nice psot..Thanks for this great information

jimdempseyatthecove · ‎06-22-2011

David,

Before you look at repopulation you have to consider a few things. First you earlier stated:

"That is, before multi-threading, procA takes 200 sec, procB taks 100 sec, and procC takes 200 sec => total 500 sec"

So you are attempting to optimize 1/5th of the entire process. A lot of effort for less than 1/8th of 20% of the run time.

A better route may beto look at procA and procC for opportunities that fit in withyour parallization efforts.
Many serial tasks:

file reads
filewrites
temporal dependent computations

can be chunked sequentially with the chunks being passed on for parallel processing.

If your procA and procC are candidates for chunking (note you can add chunk splicing information to eachchunk if need be, e.g. to pass array(chunkBegin-1)), then a good choice is the parallel_pipeline.

Intel Threading Building Blocks (TBB), and my product QuickThread (www.quickthreadprogramming.com), both have a parallel_pipeline capable of supporting sequential pipe (your procA) parallel pipe (your procB) and collated sequential pipe (your procC). Example of sketch code in QuickThread:

// variants on your procA, procB, procC
// to work with your chunking objects
void procA_buffered(yourIOcontext* io, yourBuffer* b); // serial pipe
void procB_buffered(yourBuffer* b); // parallel pipe
void procC_buffered(yourIOcontext* io, yourBuffer* b); // serial (and collated) pipe

int main(int argc, char* argv[])
{
init(argc, argv);
qtInit(-1,2); // all hw threads, 2 I/O threads
yourPipelineIOcontext pio;
qtPipeline pipeline;
pipeline.addPipe(procA_buffered); // derivation of procA to use chunks
pipeline.addPipe(procB_buffered); // derivation of procB to use chunks
pipeline.addPipe(procC_buffered); // derivation of procC to use chunks
if(!pio.openInput(argv[1])) return -1;
if(!pio.openOutput(argv[2])) return -2;
// now run the pipeline
if(!pipeline.run(&pio)) return -3;
pipeline.WaitTillDone();
if(!pio.closeOutput()) return -4;
return 0;
}

If your procA and procC (serial processes) are suitable for pipelining then 100% (less two chunks) of the sum of procA, procB, procC can be parallized. e.g. if a chunk represents 0.5% of the data, then 99% of the processing can be done in parallel. Note, try to have procA and procC do as little work as possible (since each will be processed by one thread).

Jim Dempsey

jose-jesus-ambriz-me · ‎09-03-2011

I'm suggest that:

Parallel computing with your netbook!

http://www.youtube.com/watch?v=F8alp4RBOwc&feature=grec_index

best regards

jam

jimdempseyatthecove · ‎09-21-2011

Quoting jerry0503222

I modified the wait loop with sem_wait() and sem_post() mechanism but the slow down situation is the same.
Basically, I have 8 cores, if the thread number does not exceed the core numbers, it will assign
a thread to a core, right? So, I do not think it will interfere main thread's performance.
Correct me if I am wrong. Thanks.

If your application does not also have its threads pin affinities to one "logical processor" (hardware thread) then that thread is subject to migration by the operating system. Your main thread wait loop can wait as you have done or you can wait for each handle in its list of threads to complete. There should be minimal overhead for this. Should you see loss of some portion of your computation time using test loads consisting of entirely L1 cacheable data and integer instructions onlythen this is indicative of the operating system having launched a process (threads) to do some other work. Floating point instructions have the characteristic that each core has independent floating point capability for one hardware (HT) at a time (IOW HT siblings share the FP silicon for their core).

As for candidates: profiler, OOM, Internet connection, etc...
Look around for a utility that can report on all activities (of course, running that will temporarily add a load to your system)

Jim Dempsey