Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Zhongze_L_
Beginner
63 Views

Why the multithreaded code degrade the performance?

There is two matrix classes, mat and pmat. The following is the code fragement.

class mat {
void lu();
...
};

class pmat{

mat **obMatPtr;
void operator(const blocked_range {
mat *loMatPtr;
for(int i = r.begin; i != r.end(); ++i)
{
loMatPtr = obMatPtr;
loMatPtr->lu();
}
}
...
};

pmat loPmat;
parallel_lu(...)
{
parallel_for(blocked_range(0,nblocks), loPmat, auto_partitioner());
}

The code worked correctly. At first, I ran the program with one thread on a dual-core machine
(tbb::task_scheduler_init init(deferred),.., init.initialize(1)). The execution time for performing lu on
block 0 is 118 seconds. The wall-clock time became 179 seconds when I ran it with two threads
(init.initialize(2)).

What did additional 61 seconds come from? I tested the time for just lu performance for a specific block,
that is, the time for executing loMatPtr->lu(). It should be the same no matter how many physical threads
avaible. I also thought it has nothing to do the overhead caused by thread creation and implicit
synchronization at the end of parallel_for.

Could anybody tell me the reason and how to improve the performance?


0 Kudos
4 Replies
Dmitry_Vyukov
Valued Contributor I
63 Views

Possible cause of degradation is false-sharing. Try to pad your structure as:

class mat {
void lu();
...
char pad [128];
};

Zhongze_L_
Beginner
63 Views

I modified the code as per your suggestion. Unfortunately, no improvement. Lots of malloc are called inside mat.lu(). I guess that could be
the problem. I used both scalable_malloc in tbb and a lock-free malloc library, hoard. I also used memalign to avoid false sharing. But, the performance was still the same.

Quoting - Dmitriy Vyukov
Possible cause of degradation is false-sharing. Try to pad your structure as:

class mat {
void lu();
...
char pad [128];
};


Dmitry_Vyukov
Valued Contributor I
63 Views

You may try to use following brute-force approach. Run single-threaded version under profiler. Run multi-threaded version under profiler. Compare profiles. Identify what parts of the code execute longer in multithreaded version.
For example:
single-threaded version:
func1() - 40%
func2() - 30%
func3() - 30%

multi-threaded version:
func1() - 80%
func2() - 10%
func3() - 10%

The problem is definitely in func1().

When you will identify problematic function, drill down to the machine code level.
softarts
Beginner
63 Views




as you mentioned mat.lu() will access the shared resource, that's why parallel tasks consume much time,they have to wait for other task complete(not matter lock-free or lock-based algorithm)
Reply