Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.
2481 Discussions

Why the multithreaded code degrade the performance?

Zhongze_L_
Beginner
537 Views
There is two matrix classes, mat and pmat. The following is the code fragement.

class mat {
void lu();
...
};

class pmat{

mat **obMatPtr;
void operator(const blocked_range {
mat *loMatPtr;
for(int i = r.begin; i != r.end(); ++i)
{
loMatPtr = obMatPtr;
loMatPtr->lu();
}
}
...
};

pmat loPmat;
parallel_lu(...)
{
parallel_for(blocked_range(0,nblocks), loPmat, auto_partitioner());
}

The code worked correctly. At first, I ran the program with one thread on a dual-core machine
(tbb::task_scheduler_init init(deferred),.., init.initialize(1)). The execution time for performing lu on
block 0 is 118 seconds. The wall-clock time became 179 seconds when I ran it with two threads
(init.initialize(2)).

What did additional 61 seconds come from? I tested the time for just lu performance for a specific block,
that is, the time for executing loMatPtr->lu(). It should be the same no matter how many physical threads
avaible. I also thought it has nothing to do the overhead caused by thread creation and implicit
synchronization at the end of parallel_for.

Could anybody tell me the reason and how to improve the performance?


0 Kudos
4 Replies
Dmitry_Vyukov
Valued Contributor I
537 Views
Possible cause of degradation is false-sharing. Try to pad your structure as:

class mat {
void lu();
...
char pad [128];
};

0 Kudos
Zhongze_L_
Beginner
537 Views
I modified the code as per your suggestion. Unfortunately, no improvement. Lots of malloc are called inside mat.lu(). I guess that could be
the problem. I used both scalable_malloc in tbb and a lock-free malloc library, hoard. I also used memalign to avoid false sharing. But, the performance was still the same.

Quoting - Dmitriy Vyukov
Possible cause of degradation is false-sharing. Try to pad your structure as:

class mat {
void lu();
...
char pad [128];
};


0 Kudos
Dmitry_Vyukov
Valued Contributor I
537 Views
You may try to use following brute-force approach. Run single-threaded version under profiler. Run multi-threaded version under profiler. Compare profiles. Identify what parts of the code execute longer in multithreaded version.
For example:
single-threaded version:
func1() - 40%
func2() - 30%
func3() - 30%

multi-threaded version:
func1() - 80%
func2() - 10%
func3() - 10%

The problem is definitely in func1().

When you will identify problematic function, drill down to the machine code level.
0 Kudos
softarts
Beginner
537 Views



as you mentioned mat.lu() will access the shared resource, that's why parallel tasks consume much time,they have to wait for other task complete(not matter lock-free or lock-based algorithm)
0 Kudos
Reply