Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.

poor performance on linux?

quanliwang
Beginner
384 Views
Hi all,
I have this program using TBB that runs 4 time faster than its serial version on my Windows Vista. But when I compiled it on linux, it runs slower than its serial version on the same linux machine. Both windows and linux machines have 8 cores and 32G memory. I just wonder what could have caused the big difference here?
Thanks,
Quanli
0 Kudos
6 Replies
Alexey-Kukanov
Employee
384 Views
It's highly likely for such a problem to be application-specific, e.g. possibly caused by some compile-time optimization missed. It's hard to say anything definite without any detail available. But I could suggest some experiments. For example, what happens if you limit your TBB program to run on a single thread (by using tbb::task_scheduler_init my_tbb_init_object(1)) - how it would compare with serial time and with "normal" time for the TBB program? Could you use some profiling tool and compare hot spots for the serial variant and the TBBfied one, to get any clue on the source of theissue?
0 Kudos
quanliwang
Beginner
384 Views
It's highly likely for such a problem to be application-specific, e.g. possibly caused by some compile-time optimization missed. It's hard to say anything definite without any detail available. But I could suggest some experiments. For example, what happens if you limit your TBB program to run on a single thread (by using tbb::task_scheduler_init my_tbb_init_object(1)) - how it would compare with serial time and with "normal" time for the TBB program? Could you use some profiling tool and compare hot spots for the serial variant and the TBBfied one, to get any clue on the source of theissue?
Thank you very much for the suggestions. When one thread is used, the serial and TBB versions took about the same time to run the program, with TBB version just a bit slower. When more than one threads are used, the slower became imminant. I have used gprof for a profiling and here are some outputs from that:

-------------------------------------------------------------------------------------------------------------------
For serial version with:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls ms/call ms/call name
15.20 4.02 4.02 SpecialFunctions2::mvtpdf
13.96 7.71 3.69 SpecialFunctions2::mvnormpdf
12.78 11.09 3.38 CroutMatrix::lubksb(double*, int)
7.41 13.05 1.96 CroutMatrix::ludcmp()
6.32 14.72 1.67 11200000 0.00 0.00 CDPBase::sample(double*, int, MTRand&)
3.74 15.71 0.99 SymmetricMatrix::GetRow(MatrixRowCol&)
3.37 16.60 0.89 MatrixRowCol::Copy(MatrixRowCol const&)
2.69 17.31 0.71 GeneralMatrix::Evaluate(MatrixType)
2.38 17.94 0.63 AddedMatrix::Evaluate(MatrixType)
1.68 18.39 0.45 CroutMatrix::Solver(MatrixColX&, MatrixColX const&)
1.61 18.81 0.43 5600000 0.00 0.00 CDPBase::sampleW
1.59 19.23 0.42 GeneralMatrix::GetStore()
1.49 19.63 0.40 5600000 0.00 0.00 CDPBase::sampleK
1.44 20.01 0.38 2200 0.17 0.85 CDP::clusterIterate
1.29 20.35 0.34 InvertedMatrix::Evaluate


For TBB version with 1 thread:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls ms/call ms/call name
15.49 4.64 4.64 SpecialFunctions2::mvtpdf
12.75 8.46 3.82 SpecialFunctions2::mvnormpdf
11.67 11.96 3.50 CroutMatrix::lubksb
7.24 14.13 2.17 CroutMatrix::ludcmp()
6.91 16.20 2.07 11200000 0.00 0.00 CDPBase::sample
6.58 18.17 1.97 tbb::internal::start_for<:BLOCKED_RANGE>, WSampler, tbb::simple_partitioner>::execute()
3.37 19.18 1.01 SymmetricMatrix::GetRow(MatrixRowCol&)
2.54 19.94 0.76 GeneralMatrix::Evaluate(MatrixType)
2.30 20.63 0.69 AddedMatrix::Evaluate(MatrixType)
2.04 21.24 0.61 MatrixRowCol::Copy(MatrixRowCol const&)
1.77 21.77 0.53 GeneralMatrix::GetStore()
1.59 22.24 0.48 5600000 0.00 0.00 CDPBase::sampleK
1.34 22.64 0.40 5600000 0.00 0.00 CDPBase::sampleWMTRand&)
1.10 22.97 0.33 CroutMatrix::Solver(MatrixColX&, MatrixColX const&)
1.07 23.29 0.32 InvertedMatrix::Evaluate(MatrixType)

-------------------------------------------------------------------------------------------------------------------------------

A parallel_for is called in this example. The overhead using TBB is showing in bold, which seems not very alarming. When I increase the number of threads to be used to 8, the percentage time used for tbb::internal::start_for<:BLOCKED_RANGE>, WSampler, tbb::simple_partitioner>::execute()
is decreased to about 1%.

Any further suggestions are greatly appreciated.

I can also upload the code and data example if you are interested. It is a bit large though.
Thanks.

0 Kudos
Alexey-Kukanov
Employee
384 Views
As I expected, TBB itself does not seem to be the issue. Neither does it seem that missed compiler optimizations cause the issue, since single-threaded run with TBB does not perform much worse.
So it seems to be a scalability problem. A couple of possible reasons are :
- excessive synchronization in the functions running in parallel,
-data sharing problem when the same cache lines are constantly read and written by different threads; it could be either false sharing (when data are in factindependent, and just share cache line) or true sharing (when the same data are really changed in parallel).

You should be able to identify the guilty function comparing profiling data of TBB runs with1 and more threads, and then possibly go deeper and check the exact lines of code to understand what data accesses could cause troubles.
0 Kudos
quanliwang
Beginner
384 Views
As I expected, TBB itself does not seem to be the issue. Neither does it seem that missed compiler optimizations cause the issue, since single-threaded run with TBB does not perform much worse.
So it seems to be a scalability problem. A couple of possible reasons are :
- excessive synchronization in the functions running in parallel,
-data sharing problem when the same cache lines are constantly read and written by different threads; it could be either false sharing (when data are in factindependent, and just share cache line) or true sharing (when the same data are really changed in parallel).

You should be able to identify the guilty function comparing profiling data of TBB runs with1 and more threads, and then possibly go deeper and check the exact lines of code to understand what data accesses could cause troubles.
Really appreciate your help. It is mostly like a data sharing problem. I don't have any explicit synchronization in the code. The fact is all threads need to read the same data source and then to update one segment of an array, which did not cause any problem on Windows Vista with VC2005 compiler and this fact puzzled me.
A few more questions:
1). I have a pointer pointing to an array, and different threads update different segment of that array and there is no chance for two threads update one address at the same time. Do your reply "same data are really changed in parallel" applys in this case?
2). Since both my linux and Windows boxes have the almost identical hardware profiles. It seems to suggest that the "share catch line" is a decision made by the operating system. And if that is case, is there an option to avoid this? The same TBB code does perform well on Windows.
3). I noticed that when I increase the grain size, the program does improve with my linux box, but not much change with Windows Vista. So is it possible that the grain size can be sensitive to operating systems?

The code segment that causes trouble are posted below(with minor changes):
------------------------------------------------------------------------------------------------------------------
[cpp]class WSampler
{
  CDP *my_cdp;
  CDPResult *my_result;
public:
  void operator () (const blocked_range < size_t > &r) const
  {
    MTRand tmt;
      tmt.seed ();
    RowVector row (my_cdp->mX.Ncols ());
    for (size_t i = r.begin (); i != r.end (); ++i)
      {
        for (int j = 0; j < my_cdp->mX.Ncols (); j++)
          {
            row = my_cdp->mX;
          }
        my_result->W =
          my_cdp->sampleW (row, my_result, my_cdp->prior.nu, tmt);
      }
  }
  WSampler (CDP * cdp, CDPResult * result):my_cdp (cdp), my_result (result)
  {
  }
};[/cpp]

------------------------------------------------------------------------------------------------------------------

Here "my_cdp->sampleW" is a static function, my_cdp->mX and my_result are the shared input data, while my_result->W is a pointer to an array that is being updated by all threads. Each thread has its own random number generator tmt, which is thread safe.
Just wonder if you see anything that is plain wrong.

Thanks.
0 Kudos
Alexey-Kukanov
Employee
384 Views
I see nothing bad in the code you showed. Of course I can't say anything about the functions called from there :)

The gprof data you showed before seem to roughly form the call stack (since total time for each function is the sum of its self time and total time of the above function):

[plain]15.49 4.64 4.64 SpecialFunctions2::mvtpdf
12.75 8.46 3.82 SpecialFunctions2::mvnormpdf
11.67 11.96 3.50 CroutMatrix::lubksb
7.24 14.13 2.17 CroutMatrix::ludcmp()
6.91 16.20 2.07 11200000 0.00 0.00 CDPBase::sample
6.58 18.17 1.97 tbb::internal::start_for< ... WSampler ... >::execute()
[/plain]

But those data are for the single thread run. What are the data for the multiple thread run? Which function caused the biggest increase of self time?
0 Kudos
quanliwang
Beginner
384 Views
I see nothing bad in the code you showed. Of course I can't say anything about the functions called from there :)

The gprof data you showed before seem to roughly form the call stack (since total time for each function is the sum of its self time and total time of the above function):

[plain]15.49 4.64 4.64 SpecialFunctions2::mvtpdf
12.75 8.46 3.82 SpecialFunctions2::mvnormpdf
11.67 11.96 3.50 CroutMatrix::lubksb
7.24 14.13 2.17 CroutMatrix::ludcmp()
6.91 16.20 2.07 11200000 0.00 0.00 CDPBase::sample
6.58 18.17 1.97 tbb::internal::start_for< ... WSampler ... >::execute()
[/plain]

But those data are for the single thread run. What are the data for the multiple thread run? Which function caused the biggest increase of self time?
Here is a gprof data when 8(the number of cpus) threads are used. Nothing seems particularly outstanding for me.
-------------------------------------------------------------------------------------------------------------------------------------
% cumulative self self total
11.56 4.94 4.94 CroutMatrix::lubksb
9.76 9.11 4.17 SpecialFunctions2::mvnormpdf
6.51 11.89 2.78 InvertedMatrix::Evaluate
6.17 14.53 2.64 CroutMatrix::ludcmp
5.90 17.05 2.52 MatrixType::New
3.70 18.63 1.58 SpecialFunctions2::mvtpdf
3.70 20.21 1.58 AddedMatrix::Evaluate
3.11 21.54 1.33 9985746 0.00 0.00 CDPBase::sample
3.09 22.86 1.32 IdentityMatrix::NextCol
3.02 24.15 1.29 MultipliedMatrix::Evaluate
2.77 25.33 1.19 GeneralMatrix::Evaluate
2.59 26.44 1.11 CroutMatrix::Solver
2.56 27.53 1.10 GeneralMatrix::operator+=
2.41 28.56 1.03 SymmetricMatrix::GetRow
2.39 29.58 1.02 SymmetricMatrix::RestoreCol
2.08 30.47 0.89 UpperTriangularMatrix::Type
1.64 31.17 0.70 MatrixRowCol::Copy
1.59 31.85 0.68 GeneralMatrix::NextCol
1.49 32.49 0.64 ScaledMatrix::Evaluate
1.49 33.12 0.64 CroutMatrix::CroutMatrix
1.14 33.61 0.49 4633333 0.00 0.00 CDPBase::sampleK
1.08 34.07 0.46 GeneralMatrix::tDelete
1.01 34.50 0.43 tbb::internal::start_for<..., WSampler,...::execute
0.96 34.91 0.41 1837 0.22 1.08 CDP::clusterIterate(CDPResult&, MTRand&)
--------------------------------------------------------------------------------------------------------------------------------------

There are two parallel_for's being used in the code and only one (WSampler) is causing trouble. When I disable the auto_partitioner for that and use a rather larger grain size, the performance improves and indeed runs about two times faster than the serial version, which is not quite as good as the Windows version(3 times),but still satisfactory. The maximum performance improvement is about 3.5 times faster if no thread overhead.
When I used the same grain size on Windows, there is no noticable changes in performance. So I came to a conclusion that maybe linux system I used has more thread overheads than that of the Windows system.
Anyway, it is now at least benifical on both systems to use TBB. Thank you very much for the kind help and patience.



0 Kudos
Reply