- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi all,
I have this program using TBB that runs 4 time faster than its serial version on my Windows Vista. But when I compiled it on linux, it runs slower than its serial version on the same linux machine. Both windows and linux machines have 8 cores and 32G memory. I just wonder what could have caused the big difference here?
Thanks,
Quanli
I have this program using TBB that runs 4 time faster than its serial version on my Windows Vista. But when I compiled it on linux, it runs slower than its serial version on the same linux machine. Both windows and linux machines have 8 cores and 32G memory. I just wonder what could have caused the big difference here?
Thanks,
Quanli
Link Copied
6 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It's highly likely for such a problem to be application-specific, e.g. possibly caused by some compile-time optimization missed. It's hard to say anything definite without any detail available. But I could suggest some experiments. For example, what happens if you limit your TBB program to run on a single thread (by using tbb::task_scheduler_init my_tbb_init_object(1)) - how it would compare with serial time and with "normal" time for the TBB program? Could you use some profiling tool and compare hot spots for the serial variant and the TBBfied one, to get any clue on the source of theissue?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - Alexey Kukanov (Intel)
It's highly likely for such a problem to be application-specific, e.g. possibly caused by some compile-time optimization missed. It's hard to say anything definite without any detail available. But I could suggest some experiments. For example, what happens if you limit your TBB program to run on a single thread (by using tbb::task_scheduler_init my_tbb_init_object(1)) - how it would compare with serial time and with "normal" time for the TBB program? Could you use some profiling tool and compare hot spots for the serial variant and the TBBfied one, to get any clue on the source of theissue?
-------------------------------------------------------------------------------------------------------------------
For serial version with:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls ms/call ms/call name
15.20 4.02 4.02 SpecialFunctions2::mvtpdf
13.96 7.71 3.69 SpecialFunctions2::mvnormpdf
12.78 11.09 3.38 CroutMatrix::lubksb(double*, int)
7.41 13.05 1.96 CroutMatrix::ludcmp()
6.32 14.72 1.67 11200000 0.00 0.00 CDPBase::sample(double*, int, MTRand&)
3.74 15.71 0.99 SymmetricMatrix::GetRow(MatrixRowCol&)
3.37 16.60 0.89 MatrixRowCol::Copy(MatrixRowCol const&)
2.69 17.31 0.71 GeneralMatrix::Evaluate(MatrixType)
2.38 17.94 0.63 AddedMatrix::Evaluate(MatrixType)
1.68 18.39 0.45 CroutMatrix::Solver(MatrixColX&, MatrixColX const&)
1.61 18.81 0.43 5600000 0.00 0.00 CDPBase::sampleW
1.59 19.23 0.42 GeneralMatrix::GetStore()
1.49 19.63 0.40 5600000 0.00 0.00 CDPBase::sampleK
1.44 20.01 0.38 2200 0.17 0.85 CDP::clusterIterate
1.29 20.35 0.34 InvertedMatrix::Evaluate
For TBB version with 1 thread:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls ms/call ms/call name
15.49 4.64 4.64 SpecialFunctions2::mvtpdf
12.75 8.46 3.82 SpecialFunctions2::mvnormpdf
11.67 11.96 3.50 CroutMatrix::lubksb
7.24 14.13 2.17 CroutMatrix::ludcmp()
6.91 16.20 2.07 11200000 0.00 0.00 CDPBase::sample
6.58 18.17 1.97 tbb::internal::start_for<:BLOCKED_RANGE>
3.37 19.18 1.01 SymmetricMatrix::GetRow(MatrixRowCol&)
2.54 19.94 0.76 GeneralMatrix::Evaluate(MatrixType)
2.30 20.63 0.69 AddedMatrix::Evaluate(MatrixType)
2.04 21.24 0.61 MatrixRowCol::Copy(MatrixRowCol const&)
1.77 21.77 0.53 GeneralMatrix::GetStore()
1.59 22.24 0.48 5600000 0.00 0.00 CDPBase::sampleK
1.34 22.64 0.40 5600000 0.00 0.00 CDPBase::sampleWMTRand&)
1.10 22.97 0.33 CroutMatrix::Solver(MatrixColX&, MatrixColX const&)
1.07 23.29 0.32 InvertedMatrix::Evaluate(MatrixType)
-------------------------------------------------------------------------------------------------------------------------------
A parallel_for is called in this example. The overhead using TBB is showing in bold, which seems not very alarming. When I increase the number of threads to be used to 8, the percentage time used for tbb::internal::start_for<:BLOCKED_RANGE>
Any further suggestions are greatly appreciated.
I can also upload the code and data example if you are interested. It is a bit large though.
Thanks.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
As I expected, TBB itself does not seem to be the issue. Neither does it seem that missed compiler optimizations cause the issue, since single-threaded run with TBB does not perform much worse.
So it seems to be a scalability problem. A couple of possible reasons are :
- excessive synchronization in the functions running in parallel,
-data sharing problem when the same cache lines are constantly read and written by different threads; it could be either false sharing (when data are in factindependent, and just share cache line) or true sharing (when the same data are really changed in parallel).
You should be able to identify the guilty function comparing profiling data of TBB runs with1 and more threads, and then possibly go deeper and check the exact lines of code to understand what data accesses could cause troubles.
So it seems to be a scalability problem. A couple of possible reasons are :
- excessive synchronization in the functions running in parallel,
-data sharing problem when the same cache lines are constantly read and written by different threads; it could be either false sharing (when data are in factindependent, and just share cache line) or true sharing (when the same data are really changed in parallel).
You should be able to identify the guilty function comparing profiling data of TBB runs with1 and more threads, and then possibly go deeper and check the exact lines of code to understand what data accesses could cause troubles.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - Alexey Kukanov (Intel)
As I expected, TBB itself does not seem to be the issue. Neither does it seem that missed compiler optimizations cause the issue, since single-threaded run with TBB does not perform much worse.
So it seems to be a scalability problem. A couple of possible reasons are :
- excessive synchronization in the functions running in parallel,
-data sharing problem when the same cache lines are constantly read and written by different threads; it could be either false sharing (when data are in factindependent, and just share cache line) or true sharing (when the same data are really changed in parallel).
You should be able to identify the guilty function comparing profiling data of TBB runs with1 and more threads, and then possibly go deeper and check the exact lines of code to understand what data accesses could cause troubles.
So it seems to be a scalability problem. A couple of possible reasons are :
- excessive synchronization in the functions running in parallel,
-data sharing problem when the same cache lines are constantly read and written by different threads; it could be either false sharing (when data are in factindependent, and just share cache line) or true sharing (when the same data are really changed in parallel).
You should be able to identify the guilty function comparing profiling data of TBB runs with1 and more threads, and then possibly go deeper and check the exact lines of code to understand what data accesses could cause troubles.
A few more questions:
1). I have a pointer pointing to an array, and different threads update different segment of that array and there is no chance for two threads update one address at the same time. Do your reply "same data are really changed in parallel" applys in this case?
2). Since both my linux and Windows boxes have the almost identical hardware profiles. It seems to suggest that the "share catch line" is a decision made by the operating system. And if that is case, is there an option to avoid this? The same TBB code does perform well on Windows.
3). I noticed that when I increase the grain size, the program does improve with my linux box, but not much change with Windows Vista. So is it possible that the grain size can be sensitive to operating systems?
The code segment that causes trouble are posted below(with minor changes):
------------------------------------------------------------------------------------------------------------------
[cpp]class WSampler { CDP *my_cdp; CDPResult *my_result; public: void operator () (const blocked_range < size_t > &r) const { MTRand tmt; tmt.seed (); RowVector row (my_cdp->mX.Ncols ()); for (size_t i = r.begin (); i != r.end (); ++i) { for (int j = 0; j < my_cdp->mX.Ncols (); j++) { row= my_cdp->mX ; } my_result->W = my_cdp->sampleW (row, my_result, my_cdp->prior.nu, tmt); } } WSampler (CDP * cdp, CDPResult * result):my_cdp (cdp), my_result (result) { } };[/cpp]
------------------------------------------------------------------------------------------------------------------
Here "my_cdp->sampleW" is a static function, my_cdp->mX and my_result are the shared input data, while my_result->W is a pointer to an array that is being updated by all threads. Each thread has its own random number generator tmt, which is thread safe.
Just wonder if you see anything that is plain wrong.
Thanks.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I see nothing bad in the code you showed. Of course I can't say anything about the functions called from there :)
The gprof data you showed before seem to roughly form the call stack (since total time for each function is the sum of its self time and total time of the above function):
But those data are for the single thread run. What are the data for the multiple thread run? Which function caused the biggest increase of self time?
The gprof data you showed before seem to roughly form the call stack (since total time for each function is the sum of its self time and total time of the above function):
[plain]15.49 4.64 4.64 SpecialFunctions2::mvtpdf 12.75 8.46 3.82 SpecialFunctions2::mvnormpdf 11.67 11.96 3.50 CroutMatrix::lubksb 7.24 14.13 2.17 CroutMatrix::ludcmp() 6.91 16.20 2.07 11200000 0.00 0.00 CDPBase::sample 6.58 18.17 1.97 tbb::internal::start_for< ... WSampler ... >::execute() [/plain]
But those data are for the single thread run. What are the data for the multiple thread run? Which function caused the biggest increase of self time?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - Alexey Kukanov (Intel)
I see nothing bad in the code you showed. Of course I can't say anything about the functions called from there :)
The gprof data you showed before seem to roughly form the call stack (since total time for each function is the sum of its self time and total time of the above function):
But those data are for the single thread run. What are the data for the multiple thread run? Which function caused the biggest increase of self time?
The gprof data you showed before seem to roughly form the call stack (since total time for each function is the sum of its self time and total time of the above function):
[plain]15.49 4.64 4.64 SpecialFunctions2::mvtpdf 12.75 8.46 3.82 SpecialFunctions2::mvnormpdf 11.67 11.96 3.50 CroutMatrix::lubksb 7.24 14.13 2.17 CroutMatrix::ludcmp() 6.91 16.20 2.07 11200000 0.00 0.00 CDPBase::sample 6.58 18.17 1.97 tbb::internal::start_for< ... WSampler ... >::execute() [/plain]
But those data are for the single thread run. What are the data for the multiple thread run? Which function caused the biggest increase of self time?
-------------------------------------------------------------------------------------------------------------------------------------
% cumulative self self total
11.56 4.94 4.94 CroutMatrix::lubksb
9.76 9.11 4.17 SpecialFunctions2::mvnormpdf
6.51 11.89 2.78 InvertedMatrix::Evaluate
6.17 14.53 2.64 CroutMatrix::ludcmp
5.90 17.05 2.52 MatrixType::New
3.70 18.63 1.58 SpecialFunctions2::mvtpdf
3.70 20.21 1.58 AddedMatrix::Evaluate
3.11 21.54 1.33 9985746 0.00 0.00 CDPBase::sample
3.09 22.86 1.32 IdentityMatrix::NextCol
3.02 24.15 1.29 MultipliedMatrix::Evaluate
2.77 25.33 1.19 GeneralMatrix::Evaluate
2.59 26.44 1.11 CroutMatrix::Solver
2.56 27.53 1.10 GeneralMatrix::operator+=
2.41 28.56 1.03 SymmetricMatrix::GetRow
2.39 29.58 1.02 SymmetricMatrix::RestoreCol
2.08 30.47 0.89 UpperTriangularMatrix::Type
1.64 31.17 0.70 MatrixRowCol::Copy
1.59 31.85 0.68 GeneralMatrix::NextCol
1.49 32.49 0.64 ScaledMatrix::Evaluate
1.49 33.12 0.64 CroutMatrix::CroutMatrix
1.14 33.61 0.49 4633333 0.00 0.00 CDPBase::sampleK
1.08 34.07 0.46 GeneralMatrix::tDelete
1.01 34.50 0.43 tbb::internal::start_for<..., WSampler,...::execute
0.96 34.91 0.41 1837 0.22 1.08 CDP::clusterIterate(CDPResult&, MTRand&)
--------------------------------------------------------------------------------------------------------------------------------------
There are two parallel_for's being used in the code and only one (WSampler) is causing trouble. When I disable the auto_partitioner for that and use a rather larger grain size, the performance improves and indeed runs about two times faster than the serial version, which is not quite as good as the Windows version(3 times),but still satisfactory. The maximum performance improvement is about 3.5 times faster if no thread overhead.
When I used the same grain size on Windows, there is no noticable changes in performance. So I came to a conclusion that maybe linux system I used has more thread overheads than that of the Windows system.
Anyway, it is now at least benifical on both systems to use TBB. Thank you very much for the kind help and patience.
Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page