- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
This is about the performance degradation that I find in my application with changes in the tbb::task usage. I have made two sample programs which best replicate the cahnges I made to my application program. Most importantly it also replicates the performance degradation I find in my application as well.
// Serial Fibonnaci sum
long SerialFib( long n ) {
if( n < 2)
return n;
else
return SerialFib(n-1) + SerialFib(n-2);
}
static const int CutOff = 16;
// Parallel Fibonnaci sum using tbb::task
struct FibTask: public tbb::task{
public:
long n;
long x, y;
long* sum;
bool is_continuation;
FibTask( long n_, long* sum_ ) :
n(n_), sum(sum_), is_continuation(false), x(0), y(0)
{}
tbb::task* execute()
{
if( is_continuation ) {
*sum = x+y;
return NULL;
}
else
{
if( n
return NULL;
}
else {
FibTask& a = *new(allocate_child()) FibTask( n-2, &x);
FibTask& b = *new(allocate_child()) FibTask( n-1, &y);
recycle_as_continuation();
is_continuation = true;
// Set ref_count to "two children".
set_ref_count(2);
spawn( b );
return &a;
}
}
}
};
long ParallelFib( long n ) {
long sum;
FibTask& a = *new(tbb::task::allocate_root()) FibTask( n, ∑);
tbb::task::spawn_root_and_wait(a);
return sum;
}
Here is the main program :
Main Program 1
int main( int argc, char *argv[])
{
long n = 36;
int nrTask = 1; // ( 10, 100, 1000)
std::cout << "Computing Fib( " << n << " )" << std::endl;
TaskSchedulerInit init;
tbb::tick_count t0 = tbb::tick_count::now();
for( int i = 0; i < nrTasks; ++i)
long result = ParallelFib( n);
tbb::tick_count t1 = tbb::tick_count::now();
std::cout << "Result :: " << result << " Time Taken :: " << (t1 - t0).seconds() << std::endl;
return 0;
}
The results of main program 1 on a Quad core machine with nrTasks varied from 1, 10, 100, 1000 are as follows
nrTasks Total Time taken( in seconds)
1 0.071
10 0.70
100 6.99
1000 69.83
But if I change the program as follows :
Main Program 2
static const int cacheLineSize = 64;
static const int JumpFactor = cacheLineSize / sizeof(long);
int main( int argc, char *argv[])
{
long n = 36;
int nrTasks = 1; //(10, 100, 1000)
std::cout << "Computing Fib( " << n << " )" << std::endl;
std::cout << "Number of Tasks submitted by input thread => " << nrTasks << "." << std::endl;
long* pSums = reinterpret_cast
memset( pSums, 0, nrTasks * cacheLineSize);
TaskSchedulerInit init;
tbb::tick_count t0 = tbb::tick_count::now();
EmptyTask& a = *new( tbb::task::allocate_root()) EmptyTask();
a.set_ref_count(1);
for( int i = 0; i < nrTasks; ++i) {
FibTask& b = *new ( a.allocate_additional_child_of( a)) FibTask( n, &pSums[i * JumpFactor]);
a.spawn( b);
}
a.wait_for_all();
a.destroy( a);
tbb::tick_count t1 = tbb::tick_count::now();
std::cout << "Time Taken :: " << (t1 - t0).seconds() << std::endl;
std::cout << "Validating Results .. " << std::endl;
long sum = SerialFib( n);
for( int i = 0; i < nrTasks; ++i) {
if( sum != pSums[i * JumpFactor]) {
std::cout << "Error in the Fibnocci Calculation for task " << i << std::endl;
throw;
}
}
std::cout << "Validation Passed" << std::endl;
tbb::cache_aligned_allocator
return 0;
}
The results of main program 2 are as follows
nrTasks Total Time taken( in seconds)
1 0.080
10 0.80
100 8.015
1000 80.164
Now the results of the main program 1 are as expected. However the results of main program 2 suffer some performance degradation. It looks like per task overhead that is causing the degradation.(or is it because of something else.?)
The difference in main program 1 and main program 2 is that in Program 1 the main thread waits until each task is complete and in Program 2 the main thread just spawns all the tasks and then waits on all the tasks to complete.
Are there ways where the main program 2 could be made to perform as main program 1, keeping the fact that the main thread should spawn all the tasks before it can starts working on the taks spawned?
1 Solution
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Humm.... this is getting interesting.
When comment out everything except #1 I still get the same result - 8 seconds.
Btw, I switched to MSVC2005, and tried to play with compiler switches. Results are the same - all variants take roughly 8 seconds.
Humm... try to investigate assembly of FibTask::execute() when everything except #1 is commented out and when it is not. Is there any difference? Also you may try to turn off LTCG (Link time code generation) and move FibTask::execute() to another cpp file.
When comment out everything except #1 I still get the same result - 8 seconds.
Btw, I switched to MSVC2005, and tried to play with compiler switches. Results are the same - all variants take roughly 8 seconds.
Humm... try to investigate assembly of FibTask::execute() when everything except #1 is commented out and when it is not. Is there any difference? Also you may try to turn off LTCG (Link time code generation) and move FibTask::execute() to another cpp file.
Link Copied
18 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
What happens if you use parallel_for instead in the second program?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - Raf Schietekat
What happens if you use parallel_for instead in the second program?
here is the third program using parallel_for( I hope this was what you meant).
class ApplyTasks {
public:
void operator()( const tbb::blocked_range
for( size_t i = r.begin(); i != r.end(); ++i )
sum[i * JumpFactor] = ParallelFib(n);
}
ApplyTasks( long n_, long* sum_) : n(n_), sum(sum_) {}
long n;
long* sum;
};
int main( int argc, char *argv[])
{
long n = 36;
int nrTasks = 1; //(10, 100, 1000)
std::cout << "Computing Fib( " << n << " )" << std::endl;
std::cout << "Number of Tasks submitted by input thread => " << nrTasks << "." << std::endl;
long* pSums = reinterpret_cast
memset( pSums, 0, nrTasks * cacheLineSize);
TaskSchedulerInit init;
tbb::tick_count t0 = tbb::tick_count::now();
tbb::parallel_for( tbb::blocked_range
tbb::tick_count t1 = tbb::tick_count::now();
std::cout << "Time Taken :: " << (t1 - t0).seconds() << std::endl;
std::cout << "Validating Results .. " << std::endl;
long sum = SerialFib( n);
for( int i = 0; i < nrTasks; ++i) {
if( sum != pSums[i * JumpFactor]) {
std::cout << "Error in the Fibnocci Calculation for task " << i << std::endl;
throw;
}
}
std::cout << "Validation Passed" << std::endl;
tbb::cache_aligned_allocator
return 0;
}
And here are the results
nrTasks Total Time taken( in seconds)
1 0.078
10 0.78
100 7.8
1000 78.3
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Are you sure that there is performance degradation?
Here is the results for the first program I obtained on my quad-core:
Computing Fib( 36 )
Number of Tasks submitted by input thread => 1.
Time Taken :: 0.0789379
Number of Tasks submitted by input thread => 10.
Time Taken :: 0.797832
Number of Tasks submitted by input thread => 100.
Time Taken :: 8.13045
Number of Tasks submitted by input thread => 1000.
Time Taken :: 79.8137
And here is for the second program:
Computing Fib( 36 )
Number of Tasks submitted by input thread => 1.
Time Taken :: 0.0798096
Number of Tasks submitted by input thread => 10.
Time Taken :: 0.796995
Number of Tasks submitted by input thread => 100.
Time Taken :: 8.133
Number of Tasks submitted by input thread => 1000.
Time Taken :: 79.9169
I do not see any significant performance degradation.
Please, re-measure the first program.
Here is the results for the first program I obtained on my quad-core:
Computing Fib( 36 )
Number of Tasks submitted by input thread => 1.
Time Taken :: 0.0789379
Number of Tasks submitted by input thread => 10.
Time Taken :: 0.797832
Number of Tasks submitted by input thread => 100.
Time Taken :: 8.13045
Number of Tasks submitted by input thread => 1000.
Time Taken :: 79.8137
And here is for the second program:
Computing Fib( 36 )
Number of Tasks submitted by input thread => 1.
Time Taken :: 0.0798096
Number of Tasks submitted by input thread => 10.
Time Taken :: 0.796995
Number of Tasks submitted by input thread => 100.
Time Taken :: 8.133
Number of Tasks submitted by input thread => 1000.
Time Taken :: 79.9169
I do not see any significant performance degradation.
Please, re-measure the first program.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
May this is what you meant because this seems to be fine.
Here is program 4
class SpawnTasks {
public:
void operator()( const tbb::blocked_range& r ) const {
for( size_t i = r.begin(); i != r.end(); ++i ) {
FibTask& b = *new( barrier->allocate_additional_child_of(*barrier)) FibTask( n, ∑[i * JumpFactor]);
barrier->Spawn( b);
}
}
SpawnTasks( long n_, long* sum_, EmptyTask* barrier_) : n(n_), sum(sum_), barrier(barrier_) {}
long n;
long* sum;
EmptyTask* barrier;
};
int main( int argc, char *argv[])
{
long n = argc>1 ? strtol(argv[1],0,0) : 36;
int nrTasks = argc>2 ? strtol(argv[2],0,0) : 100;
std::cout << "Computing Fib( " << n << " )" << std::endl;
std::cout << "Number of Tasks submitted by input thread => " << nrTasks << "." << std::endl;
long* pSums = reinterpret_cast( tbb::cache_aligned_allocator().allocate( nrTasks * cacheLineSize));
memset( pSums, 0, nrTasks * cacheLineSize);
TaskSchedulerInit init;
tbb::tick_count t0 = tbb::tick_count::now();
EmptyTask& a = *new( tbb::task::allocate_root()) EmptyTask();
a.set_ref_count(1);
tbb::parallel_for( tbb::blocked_range( 0, nrTasks), SpawnTasks( n, pSums, &a));
a.wait_for_all();
a.destroy( a);
tbb::tick_count t1 = tbb::tick_count::now();
std::cout << "Time Taken :: " << (t1 - t0).seconds() << std::endl;
std::cout << "Validating Results .. " << std::endl;
long sum = SerialFib( n);
for( int i = 0; i < nrTasks; ++i) {
if( sum != pSums[i * JumpFactor]) {
std::cout << "Error in the Fibnocci Calculation for task " << i << std::endl;
throw;
}
}
std::cout << "Validation Passed" << std::endl;
tbb::cache_aligned_allocator().deallocate( reinterpret_cast(pSums), 0);
return 0;
}
And here are the results
1 0.071
10 0.708
100 7.06
1000 70.66
I have no clue whats going on any way. Why is this better than program 2?
Here is program 4
class SpawnTasks {
public:
void operator()( const tbb::blocked_range
for( size_t i = r.begin(); i != r.end(); ++i ) {
FibTask& b = *new( barrier->allocate_additional_child_of(*barrier)) FibTask( n, ∑[i * JumpFactor]);
barrier->Spawn( b);
}
}
SpawnTasks( long n_, long* sum_, EmptyTask* barrier_) : n(n_), sum(sum_), barrier(barrier_) {}
long n;
long* sum;
EmptyTask* barrier;
};
int main( int argc, char *argv[])
{
long n = argc>1 ? strtol(argv[1],0,0) : 36;
int nrTasks = argc>2 ? strtol(argv[2],0,0) : 100;
std::cout << "Computing Fib( " << n << " )" << std::endl;
std::cout << "Number of Tasks submitted by input thread => " << nrTasks << "." << std::endl;
long* pSums = reinterpret_cast
memset( pSums, 0, nrTasks * cacheLineSize);
TaskSchedulerInit init;
tbb::tick_count t0 = tbb::tick_count::now();
EmptyTask& a = *new( tbb::task::allocate_root()) EmptyTask();
a.set_ref_count(1);
tbb::parallel_for( tbb::blocked_range
a.wait_for_all();
a.destroy( a);
std::cout << "Time Taken :: " << (t1 - t0).seconds() << std::endl;
std::cout << "Validating Results .. " << std::endl;
long sum = SerialFib( n);
for( int i = 0; i < nrTasks; ++i) {
if( sum != pSums[i * JumpFactor]) {
std::cout << "Error in the Fibnocci Calculation for task " << i << std::endl;
throw;
}
}
std::cout << "Validation Passed" << std::endl;
tbb::cache_aligned_allocator
return 0;
}
And here are the results
1 0.071
10 0.708
100 7.06
1000 70.66
I have no clue whats going on any way. Why is this better than program 2?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - Dmitriy Vyukov
Are you sure that there is performance degradation?
Here is the results for the first program I obtained on my quad-core:
Computing Fib( 36 )
Number of Tasks submitted by input thread => 1.
Time Taken :: 0.0789379
Number of Tasks submitted by input thread => 10.
Time Taken :: 0.797832
Number of Tasks submitted by input thread => 100.
Time Taken :: 8.13045
Number of Tasks submitted by input thread => 1000.
Time Taken :: 79.8137
And here is for the second program:
Computing Fib( 36 )
Number of Tasks submitted by input thread => 1.
Time Taken :: 0.0798096
Number of Tasks submitted by input thread => 10.
Time Taken :: 0.796995
Number of Tasks submitted by input thread => 100.
Time Taken :: 8.133
Number of Tasks submitted by input thread => 1000.
Time Taken :: 79.9169
I do not see any significant performance degradation.
Please, re-measure the first program.
Here is the results for the first program I obtained on my quad-core:
Computing Fib( 36 )
Number of Tasks submitted by input thread => 1.
Time Taken :: 0.0789379
Number of Tasks submitted by input thread => 10.
Time Taken :: 0.797832
Number of Tasks submitted by input thread => 100.
Time Taken :: 8.13045
Number of Tasks submitted by input thread => 1000.
Time Taken :: 79.8137
And here is for the second program:
Computing Fib( 36 )
Number of Tasks submitted by input thread => 1.
Time Taken :: 0.0798096
Number of Tasks submitted by input thread => 10.
Time Taken :: 0.796995
Number of Tasks submitted by input thread => 100.
Time Taken :: 8.133
Number of Tasks submitted by input thread => 1000.
Time Taken :: 79.9169
I do not see any significant performance degradation.
Please, re-measure the first program.
I use Visual Studio 2005. Does this have to do anything with the compiler optimizations?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - Shankar
And here are the results
1 0.071
10 0.708
100 7.06
1000 70.66
I have no clue whats going on any way. Why is this better than program 2?
1 0.071
10 0.708
100 7.06
1000 70.66
I have no clue whats going on any way. Why is this better than program 2?
I still get the same results for this program:
Computing Fib( 36 )
Number of Tasks submitted by input thread => 1.
Time Taken :: 0.0798366
Number of Tasks submitted by input thread => 10.
Time Taken :: 0.995838
Number of Tasks submitted by input thread => 100.
Time Taken :: 7.97388
Number of Tasks submitted by input thread => 1000.
Time Taken :: 79.7281
They are all pretty much the same on my quad-core machine...
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
FYI. Intel Core2 Quad CPU Q6600 @ 2.40 GHz 2.39 GHz, 3.86GB of RAM is the hardware that I use.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - Dmitriy Vyukov
I still get the same results for this program:
Computing Fib( 36 )
Number of Tasks submitted by input thread => 1.
Time Taken :: 0.0798366
Number of Tasks submitted by input thread => 10.
Time Taken :: 0.995838
Number of Tasks submitted by input thread => 100.
Time Taken :: 7.97388
Number of Tasks submitted by input thread => 1000.
Time Taken :: 79.7281
They are all pretty much the same on my quad-core machine...
As I said before I have no clue :)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - Shankar
I have no clue whats going on any way. Why is this better than program 2?
There is some difference as to how tasks are allocated and spawned.
In program 2 all root fib tasks are allocated in one thread, thus some potential for false sharing. Also all the tasks are initially placed into single task deque, thus some increased contention during stealing.
Version with parallel_for allocate and spawn tasks in a distributed manner, so no above-mentioned problems.
BUT I do NOT think that above-mentioned problems account for such a big performance difference, because most of the time is spent in fibonachi calculation anyway.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - Dmitriy Vyukov
There is some difference as to how tasks are allocated and spawned.
In program 2 all root fib tasks are allocated in one thread, thus some potential for false sharing. Also all the tasks are initially placed into single task deque, thus some increased contention during stealing.
Version with parallel_for allocate and spawn tasks in a distributed manner, so no above-mentioned problems.
BUT I do NOT think that above-mentioned problems account for such a big performance difference, because most of the time is spent in fibonachi calculation anyway.
In program 2 all root fib tasks are allocated in one thread, thus some potential for false sharing. Also all the tasks are initially placed into single task deque, thus some increased contention during stealing.
Version with parallel_for allocate and spawn tasks in a distributed manner, so no above-mentioned problems.
BUT I do NOT think that above-mentioned problems account for such a big performance difference, because most of the time is spent in fibonachi calculation anyway.
You may also try following code:
[cpp]struct SpawnTask : tbb::task { int count; int n; long* result; SpawnTask(int count, int n, long* result) : count(count) , n(n) , result(result) {} virtual tbb::task* execute() { if (count == 1) { set_ref_count(2); spawn_and_wait_for_all(*new(allocate_child()) FibTask( n, result)); } else if (count == 2) { set_ref_count(3); spawn(*new(allocate_child()) FibTask( n, result)); spawn_and_wait_for_all(*new(allocate_child()) FibTask( n, result + JumpFactor)); } else { int count2 = count / 2; set_ref_count(3); spawn(*new(allocate_child()) SpawnTask(count2, n, result)); spawn_and_wait_for_all(*new(allocate_child()) SpawnTask(count - count2, n, result + count2 * JumpFactor)); } return 0; } }; [/cpp]
p.s. I also test on Q6600.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - Dmitriy Vyukov
You may also try following code:
[cpp]struct SpawnTask : tbb::task
{
int count;
int n;
long* result;
SpawnTask(int count, int n, long* result)
: count(count)
, n(n)
, result(result)
{}
virtual tbb::task* execute()
{
if (count == 1)
{
set_ref_count(2);
spawn_and_wait_for_all(*new(allocate_child()) FibTask( n, result));
}
else if (count == 2)
{
set_ref_count(3);
spawn(*new(allocate_child()) FibTask( n, result));
spawn_and_wait_for_all(*new(allocate_child()) FibTask( n, result + JumpFactor));
}
else
{
int count2 = count / 2;
set_ref_count(3);
spawn(*new(allocate_child()) SpawnTask(count2, n, result));
spawn_and_wait_for_all(*new(allocate_child()) SpawnTask(count - count2, n, result + count2 * JumpFactor));
}
return 0;
}
};
[/cpp]
p.s. I also test on Q6600.
what is count here that you pass to the constructor. Also how is the main program written that uses SpawnTasks. Does it use parallel_for?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - Shankar
what is count here that you pass to the constructor. Also how is the main program written that uses SpawnTasks. Does it use parallel_for?
No, just:
SpawnTask& a = *new(tbb::task::allocate_root()) SpawnTask(nrTasks, n, pSums);
tbb::task::spawn_root_and_wait(a);
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Run following test:
[cpp]int main() { long n = 36; tbb::task_scheduler_init init; int task_counts[] = {10, 50, 100}; std::cout << "count:t"; for (int idx = 0; idx != 3; idx += 1) std::cout << task_counts[idx] << "t"; std::cout << std::endl; for (int test_count = 0; test_count != 3; test_count += 1) { std::cout << "#1:t"; for (int idx = 0; idx != 3; idx += 1) { int nrTasks = task_counts[idx]; tbb::tick_count t0 = tbb::tick_count::now(); for( int i = 0; i < nrTasks; ++i) long result = ParallelFib( n); tbb::tick_count t1 = tbb::tick_count::now(); std::cout << std::setprecision(4) << (t1 - t0).seconds() << "t"; } std::cout << std::endl; std::cout << "#2:t"; for (int idx = 0; idx != 3; idx += 1) { int nrTasks = task_counts[idx]; tbb::tick_count t0 = tbb::tick_count::now(); long* pSums = reinterpret_cast( tbb::cache_aligned_allocator ().allocate( nrTasks * cacheLineSize)); memset( pSums, 0, nrTasks * cacheLineSize); tbb::empty_task& a = *new( tbb::task::allocate_root()) tbb::empty_task(); a.set_ref_count(1); for( int i = 0; i < nrTasks; ++i) { FibTask& b = *new ( a.allocate_additional_child_of( a)) FibTask( n, &pSums[i * JumpFactor]); a.spawn( b); } a.wait_for_all(); a.destroy( a); tbb::cache_aligned_allocator ().deallocate( reinterpret_cast (pSums), 0); tbb::tick_count t1 = tbb::tick_count::now(); std::cout << std::setprecision(4) << (t1 - t0).seconds() << "t"; } std::cout << std::endl; std::cout << "#3:t"; for (int idx = 0; idx != 3; idx += 1) { int nrTasks = task_counts[idx]; tbb::tick_count t0 = tbb::tick_count::now(); long* pSums = reinterpret_cast ( tbb::cache_aligned_allocator ().allocate( nrTasks * cacheLineSize)); memset( pSums, 0, nrTasks * cacheLineSize); tbb::empty_task& a = *new( tbb::task::allocate_root()) tbb::empty_task(); a.set_ref_count(1); tbb::parallel_for( tbb::blocked_range ( 0, nrTasks), SpawnTasks( n, pSums, &a)); a.wait_for_all(); a.destroy( a); tbb::cache_aligned_allocator ().deallocate( reinterpret_cast (pSums), 0); tbb::tick_count t1 = tbb::tick_count::now(); std::cout << std::setprecision(4) << (t1 - t0).seconds() << "t"; } std::cout << std::endl; std::cout << "#4:t"; for (int idx = 0; idx != 3; idx += 1) { int nrTasks = task_counts[idx]; tbb::tick_count t0 = tbb::tick_count::now(); long* pSums = reinterpret_cast ( tbb::cache_aligned_allocator ().allocate( nrTasks * cacheLineSize)); memset( pSums, 0, nrTasks * cacheLineSize); SpawnTask& a = *new(tbb::task::allocate_root()) SpawnTask(nrTasks, n, pSums); tbb::task::spawn_root_and_wait(a); tbb::cache_aligned_allocator ().deallocate( reinterpret_cast (pSums), 0); tbb::tick_count t1 = tbb::tick_count::now(); std::cout << std::setprecision(4) << (t1 - t0).seconds() << "t"; } std::cout << std::endl; } } [/cpp]
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - Dmitriy Vyukov
Run following test:
My results are:
[cpp]count: 10 50 100 #1: 0.9601 3.988 7.95 #2: 0.7952 3.998 7.968 #3: 0.7949 4.003 7.92 #4: 0.7897 3.953 7.921 #1: 0.8067 3.987 7.979 #2: 0.8078 4.074 8.31 #3: 0.8302 4.206 8.468 #4: 0.8568 4.282 8.651 #1: 0.8695 4.392 8.858 #2: 0.8912 4.458 8.983 #3: 0.9028 4.527 9.135 #4: 0.9097 4.583 9.253[/cpp]
MSVC2008, release build, Q6600, TBB2.2
On first run all variants take 7.9 secs.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - Dmitriy Vyukov
My results are:
[cpp]count: 10 50 100
#1: 0.9601 3.988 7.95
#2: 0.7952 3.998 7.968
#3: 0.7949 4.003 7.92
#4: 0.7897 3.953 7.921
#1: 0.8067 3.987 7.979
#2: 0.8078 4.074 8.31
#3: 0.8302 4.206 8.468
#4: 0.8568 4.282 8.651
#1: 0.8695 4.392 8.858
#2: 0.8912 4.458 8.983
#3: 0.9028 4.527 9.135
#4: 0.9097 4.583 9.253[/cpp]
MSVC2008, release build, Q6600, TBB2.2
On first run all variants take 7.9 secs.
[cpp]They look exactly like the results you have got. :)
Here are the results :
count: 10 50 100 #1: 0.8022 4.003 8.152 #2: 0.8 4.119 8.009 #3: 0.7991 3.989 7.98 #4: 0.8006 3.986 7.981 #1: 0.7984 3.993 7.986 #2: 0.7992 4.004 8.002 #3: 0.7984 3.994 7.992 #4: 0.8007 4.145 7.996 #1: 0.7987 3.992 7.973 #2: 0.801 4.006 8.006 #3: 0.7987 3.996 7.983 #4: 0.7992 3.985 8[/cpp]
However I would want you to try running this file TestScheduler.cpp.
What I have done here is that I have commented the options 2, 3 and 4 in your code. I know this might sound weird but the results when I run these are as follows
[cpp]count: 10 50 100 #1: 0.7038 3.512 7.043 #1: 0.697 3.517 6.991 #1: 0.7063 3.512 7.036
I would like to know if you get the same results too as mine. [/cpp]
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Humm.... this is getting interesting.
When comment out everything except #1 I still get the same result - 8 seconds.
Btw, I switched to MSVC2005, and tried to play with compiler switches. Results are the same - all variants take roughly 8 seconds.
Humm... try to investigate assembly of FibTask::execute() when everything except #1 is commented out and when it is not. Is there any difference? Also you may try to turn off LTCG (Link time code generation) and move FibTask::execute() to another cpp file.
When comment out everything except #1 I still get the same result - 8 seconds.
Btw, I switched to MSVC2005, and tried to play with compiler switches. Results are the same - all variants take roughly 8 seconds.
Humm... try to investigate assembly of FibTask::execute() when everything except #1 is commented out and when it is not. Is there any difference? Also you may try to turn off LTCG (Link time code generation) and move FibTask::execute() to another cpp file.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - Dmitriy Vyukov
Humm.... this is getting interesting.
When comment out everything except #1 I still get the same result - 8 seconds.
Btw, I switched to MSVC2005, and tried to play with compiler switches. Results are the same - all variants take roughly 8 seconds.
Humm... try to investigate assembly of FibTask::execute() when everything except #1 is commented out and when it is not. Is there any difference? Also you may try to turn off LTCG (Link time code generation) and move FibTask::execute() to another cpp file.
When comment out everything except #1 I still get the same result - 8 seconds.
Btw, I switched to MSVC2005, and tried to play with compiler switches. Results are the same - all variants take roughly 8 seconds.
Humm... try to investigate assembly of FibTask::execute() when everything except #1 is commented out and when it is not. Is there any difference? Also you may try to turn off LTCG (Link time code generation) and move FibTask::execute() to another cpp file.
Hi Dmitriy,
I think this suggestion of yours helped in some way. I did the following steps
* Moved the FibTask, Spawner and SpawnTask classes to a file named FibTask.h. I only kept only the declarations of the methods int the .h file and moved the implementation of the methods to FibTask.cpp.
* I also moved SerialFib and ParallelFib declarations to FibTask.h and their implementation to FibTask.cpp
* Now only my main function lies in the file TestMain.cpp and the code is unchanged.
And then I switched off /GL and /LTCG compiler options and ran the two tests( one where only #1 is run and other where all #1,2,3,4 are run).
I switched on /GL and /LTCG and ran the two tests again.
Here are the results now
[cpp]1. Program compiled by switching off /GL and/LTCG optionsAnd then I switched off /GL and /LTCG compiler options and ran the two tests( one where only #1 is run and other where all #1,2,3,4 are run).
option A Running only # 1 ( #2, #3, #4 commented)
count: 10 50 100
#1: 0.7955 3.979 7.961
#1: 0.7957 3.981 7.997
#1: 0.8337 4.331 7.994
option B Running #1, #2, #3, #4
count: 10 50 100
#1: 0.7576 3.771 7.517
#2: 0.75 3.742 7.499
#3: 0.7944 3.835 7.518
#4: 0.7524 3.705 7.483
#1: 0.7438 3.688 7.367
#2: 0.7357 3.664 7.37
#3: 0.7383 3.686 7.404
#4: 0.7541 3.761 7.48
#1: 0.7537 3.724 7.427
#2: 0.7537 3.731 7.454
#3: 0.7507 3.738 7.504
#4: 0.7526 3.72 7.853
2. Program compiled by switching on /GL and/LTCG options
option A Running only # 1 ( #2, #3, #4 commented)
count: 10 50 100
#1: 0.6983 3.493 6.994
#1: 0.6994 3.493 6.981
#1: 0.7032 3.508 7.013
option B Running #1, #2, #3, #4
count: 10 50 100
#1: 0.7027 3.519 6.995
#2: 0.6967 3.494 7
#3: 0.6969 3.491 7
#4: 0.7008 3.489 6.993
#1: 0.6984 3.508 7.001
#2: 0.6977 3.497 6.99
#3: 0.7027 3.492 6.99
#4: 0.6986 3.488 6.989
#1: 0.7008 3.498 7.016
#2: 0.6981 3.49 6.99
#3: 0.7016 3.495 6.985
#4: 0.6992 3.495 7.112
[/cpp]
So what has helped in your suggestion is that you told me to move the implementation of FibTask, Spawner and SpawnTask to another .cpp file.
And now all the different options(i.e #1, #2, #3 , #4) perform equally(as you expect) and in all of them each task takes only 0.070 sec ( as I want).
Atleast the problem is solved now. But Im not clear as to why it got solved by moving implementation to another cpp file?
Is this because of the task's vtable replication or something? Because I have ran into that problem when I tried to wrap the tbb::task class(by deriving from tbb::task) and overrided a virtual method note_affinity providing the implementation in the .h file itself. But the solution came in from the comments on top of the function void task::note_affinity( affinity_id ) defined in task.cpp file.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page