- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
There is two matrix classes, mat and pmat. The following is the code fragement.
class mat {
void lu();
...
};
class pmat{
mat **obMatPtr;
void operator(const blocked_range {
mat *loMatPtr;
for(int i = r.begin; i != r.end(); ++i)
{
loMatPtr = obMatPtr;
loMatPtr->lu();
}
}
...
};
pmat loPmat;
parallel_lu(...)
{
parallel_for(blocked_range(0,nblocks), loPmat, auto_partitioner());
}
The code worked correctly. At first, I ran the program with one thread on a dual-core machine
(tbb::task_scheduler_init init(deferred),.., init.initialize(1)). The execution time for performing lu on
block 0 is 118 seconds. The wall-clock time became 179 seconds when I ran it with two threads
(init.initialize(2)).
What did additional 61 seconds come from? I tested the time for just lu performance for a specific block,
that is, the time for executing loMatPtr->lu(). It should be the same no matter how many physical threads
avaible. I also thought it has nothing to do the overhead caused by thread creation and implicit
synchronization at the end of parallel_for.
Could anybody tell me the reason and how to improve the performance?
class mat {
void lu();
...
};
class pmat{
mat **obMatPtr;
void operator(const blocked_range
mat *loMatPtr;
for(int i = r.begin; i != r.end(); ++i)
{
loMatPtr = obMatPtr;
loMatPtr->lu();
}
}
...
};
pmat loPmat;
parallel_lu(...)
{
parallel_for(blocked_range
}
The code worked correctly. At first, I ran the program with one thread on a dual-core machine
(tbb::task_scheduler_init init(deferred),.., init.initialize(1)). The execution time for performing lu on
block 0 is 118 seconds. The wall-clock time became 179 seconds when I ran it with two threads
(init.initialize(2)).
What did additional 61 seconds come from? I tested the time for just lu performance for a specific block,
that is, the time for executing loMatPtr->lu(). It should be the same no matter how many physical threads
avaible. I also thought it has nothing to do the overhead caused by thread creation and implicit
synchronization at the end of parallel_for.
Could anybody tell me the reason and how to improve the performance?
Link Copied
4 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Possible cause of degradation is false-sharing. Try to pad your structure as:
class mat {
void lu();
...
char pad [128];
};
class mat {
void lu();
...
char pad [128];
};
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I modified the code as per your suggestion. Unfortunately, no improvement. Lots of malloc are called inside mat.lu(). I guess that could be
the problem. I used both scalable_malloc in tbb and a lock-free malloc library, hoard. I also used memalign to avoid false sharing. But, the performance was still the same.
Quoting - Dmitriy Vyukov
the problem. I used both scalable_malloc in tbb and a lock-free malloc library, hoard. I also used memalign to avoid false sharing. But, the performance was still the same.
Quoting - Dmitriy Vyukov
Possible cause of degradation is false-sharing. Try to pad your structure as:
class mat {
void lu();
...
char pad [128];
};
class mat {
void lu();
...
char pad [128];
};
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You may try to use following brute-force approach. Run single-threaded version under profiler. Run multi-threaded version under profiler. Compare profiles. Identify what parts of the code execute longer in multithreaded version.
For example:
single-threaded version:
func1() - 40%
func2() - 30%
func3() - 30%
multi-threaded version:
func1() - 80%
func2() - 10%
func3() - 10%
The problem is definitely in func1().
When you will identify problematic function, drill down to the machine code level.
For example:
single-threaded version:
func1() - 40%
func2() - 30%
func3() - 30%
multi-threaded version:
func1() - 80%
func2() - 10%
func3() - 10%
The problem is definitely in func1().
When you will identify problematic function, drill down to the machine code level.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - zhongzel@gmail.com
as you mentioned mat.lu() will access the shared resource, that's why parallel tasks consume much time,they have to wait for other task complete(not matter lock-free or lock-based algorithm)
Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page