Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.

Performance degradation

Shankar1
Beginner
893 Views

This is about the performance degradation that I find in my application with changes in the tbb::task usage. I have made two sample programs which best replicate the cahnges I made to my application program. Most importantly it also replicates the performance degradation I find in my application as well.

// Serial Fibonnaci sum
long SerialFib( long n ) {
if( n < 2)
return n;
else
return SerialFib(n-1) + SerialFib(n-2);
}

static const int CutOff = 16;
// Parallel Fibonnaci sum using tbb::task
struct FibTask: public tbb::task{
public:
long n;
long x, y;
long* sum;
bool is_continuation;

FibTask( long n_, long* sum_ ) :
n(n_), sum(sum_), is_continuation(false), x(0), y(0)
{}

tbb::task* execute()
{
if( is_continuation ) {
*sum = x+y;
return NULL;
}
else
{
if( n *sum = SerialFib(n);
return NULL;
}
else {
FibTask& a = *new(allocate_child()) FibTask( n-2, &x);
FibTask& b = *new(allocate_child()) FibTask( n-1, &y);
recycle_as_continuation();
is_continuation = true;
// Set ref_count to "two children".
set_ref_count(2);
spawn( b );
return &a;
}
}
}
};

long ParallelFib( long n ) {
long sum;
FibTask& a = *new(tbb::task::allocate_root()) FibTask( n, ∑);
tbb::task::spawn_root_and_wait(a);
return sum;
}

Here is the main program :
Main Program 1

int main( int argc, char *argv[])
{
long n = 36;
int nrTask = 1; // ( 10, 100, 1000)
std::cout << "Computing Fib( " << n << " )" << std::endl;

TaskSchedulerInit init;
tbb::tick_count t0 = tbb::tick_count::now();
for( int i = 0; i < nrTasks; ++i)
long result = ParallelFib( n);
tbb::tick_count t1 = tbb::tick_count::now();
std::cout << "Result :: " << result << " Time Taken :: " << (t1 - t0).seconds() << std::endl;
return 0;
}

The results of main program 1 on a Quad core machine with nrTasks varied from 1, 10, 100, 1000 are as follows

nrTasks Total Time taken( in seconds)
1 0.071
10 0.70
100 6.99
1000 69.83

But if I change the program as follows :
Main Program 2

static const int cacheLineSize = 64;
static const int JumpFactor = cacheLineSize / sizeof(long);

int main( int argc, char *argv[])
{
long n = 36;
int nrTasks = 1; //(10, 100, 1000)

std::cout << "Computing Fib( " << n << " )" << std::endl;
std::cout << "Number of Tasks submitted by input thread => " << nrTasks << "." << std::endl;

long* pSums = reinterpret_cast( tbb::cache_aligned_allocator().allocate( nrTasks * cacheLineSize));
memset( pSums, 0, nrTasks * cacheLineSize);

TaskSchedulerInit init;
tbb::tick_count t0 = tbb::tick_count::now();
EmptyTask& a = *new( tbb::task::allocate_root()) EmptyTask();
a.set_ref_count(1);
for( int i = 0; i < nrTasks; ++i) {
FibTask& b = *new ( a.allocate_additional_child_of( a)) FibTask( n, &pSums[i * JumpFactor]);
a.spawn( b);
}
a.wait_for_all();
a.destroy( a);
tbb::tick_count t1 = tbb::tick_count::now();
std::cout << "Time Taken :: " << (t1 - t0).seconds() << std::endl;

std::cout << "Validating Results .. " << std::endl;
long sum = SerialFib( n);
for( int i = 0; i < nrTasks; ++i) {
if( sum != pSums[i * JumpFactor]) {
std::cout << "Error in the Fibnocci Calculation for task " << i << std::endl;
throw;
}
}
std::cout << "Validation Passed" << std::endl;
tbb::cache_aligned_allocator().deallocate( reinterpret_cast(pSums), 0);
return 0;
}

The results of main program 2 are as follows

nrTasks Total Time taken( in seconds)
1 0.080
10 0.80
100 8.015
1000 80.164

Now the results of the main program 1 are as expected. However the results of main program 2 suffer some performance degradation. It looks like per task overhead that is causing the degradation.(or is it because of something else.?)

The difference in main program 1 and main program 2 is that in Program 1 the main thread waits until each task is complete and in Program 2 the main thread just spawns all the tasks and then waits on all the tasks to complete.

Are there ways where the main program 2 could be made to perform as main program 1, keeping the fact that the main thread should spawn all the tasks before it can starts working on the taks spawned?
0 Kudos
1 Solution
Dmitry_Vyukov
Valued Contributor I
893 Views
Humm.... this is getting interesting.

When comment out everything except #1 I still get the same result - 8 seconds.
Btw, I switched to MSVC2005, and tried to play with compiler switches. Results are the same - all variants take roughly 8 seconds.

Humm... try to investigate assembly of FibTask::execute() when everything except #1 is commented out and when it is not. Is there any difference? Also you may try to turn off LTCG (Link time code generation) and move FibTask::execute() to another cpp file.

View solution in original post

0 Kudos
18 Replies
RafSchietekat
Valued Contributor III
893 Views
What happens if you use parallel_for instead in the second program?
0 Kudos
Shankar1
Beginner
893 Views
Quoting - Raf Schietekat
What happens if you use parallel_for instead in the second program?

here is the third program using parallel_for( I hope this was what you meant).

class ApplyTasks {
public:
void operator()( const tbb::blocked_range& r ) const {
for( size_t i = r.begin(); i != r.end(); ++i )
sum[i * JumpFactor] = ParallelFib(n);
}

ApplyTasks( long n_, long* sum_) : n(n_), sum(sum_) {}

long n;
long* sum;
};

int main( int argc, char *argv[])
{
long n = 36;
int nrTasks = 1; //(10, 100, 1000)

std::cout << "Computing Fib( " << n << " )" << std::endl;
std::cout << "Number of Tasks submitted by input thread => " << nrTasks << "." << std::endl;

long* pSums = reinterpret_cast( tbb::cache_aligned_allocator().allocate( nrTasks * cacheLineSize));
memset( pSums, 0, nrTasks * cacheLineSize);

TaskSchedulerInit init;
tbb::tick_count t0 = tbb::tick_count::now();
tbb::parallel_for( tbb::blocked_range( 0, nrTasks), ApplyTasks( n, pSums));
tbb::tick_count t1 = tbb::tick_count::now();
std::cout << "Time Taken :: " << (t1 - t0).seconds() << std::endl;

std::cout << "Validating Results .. " << std::endl;
long sum = SerialFib( n);
for( int i = 0; i < nrTasks; ++i) {
if( sum != pSums[i * JumpFactor]) {
std::cout << "Error in the Fibnocci Calculation for task " << i << std::endl;
throw;
}
}
std::cout << "Validation Passed" << std::endl;
tbb::cache_aligned_allocator().deallocate( reinterpret_cast(pSums), 0);
return 0;
}

And here are the results

nrTasks Total Time taken( in seconds)
1 0.078
10 0.78
100 7.8
1000 78.3

0 Kudos
Dmitry_Vyukov
Valued Contributor I
893 Views
Are you sure that there is performance degradation?
Here is the results for the first program I obtained on my quad-core:
Computing Fib( 36 )
Number of Tasks submitted by input thread => 1.
Time Taken :: 0.0789379
Number of Tasks submitted by input thread => 10.
Time Taken :: 0.797832
Number of Tasks submitted by input thread => 100.
Time Taken :: 8.13045
Number of Tasks submitted by input thread => 1000.
Time Taken :: 79.8137

And here is for the second program:
Computing Fib( 36 )
Number of Tasks submitted by input thread => 1.
Time Taken :: 0.0798096
Number of Tasks submitted by input thread => 10.
Time Taken :: 0.796995
Number of Tasks submitted by input thread => 100.
Time Taken :: 8.133
Number of Tasks submitted by input thread => 1000.
Time Taken :: 79.9169

I do not see any significant performance degradation.
Please, re-measure the first program.

0 Kudos
Shankar1
Beginner
893 Views
May this is what you meant because this seems to be fine.

Here is program 4

class SpawnTasks {
public:
void operator()( const tbb::blocked_range& r ) const {
for( size_t i = r.begin(); i != r.end(); ++i ) {
FibTask& b = *new( barrier->allocate_additional_child_of(*barrier)) FibTask( n, ∑[i * JumpFactor]);
barrier->Spawn( b);
}
}

SpawnTasks( long n_, long* sum_, EmptyTask* barrier_) : n(n_), sum(sum_), barrier(barrier_) {}

long n;
long* sum;
EmptyTask* barrier;
};

int main( int argc, char *argv[])
{
long n = argc>1 ? strtol(argv[1],0,0) : 36;
int nrTasks = argc>2 ? strtol(argv[2],0,0) : 100;

std::cout << "Computing Fib( " << n << " )" << std::endl;
std::cout << "Number of Tasks submitted by input thread => " << nrTasks << "." << std::endl;

long* pSums = reinterpret_cast( tbb::cache_aligned_allocator().allocate( nrTasks * cacheLineSize));
memset( pSums, 0, nrTasks * cacheLineSize);

TaskSchedulerInit init;
tbb::tick_count t0 = tbb::tick_count::now();
EmptyTask& a = *new( tbb::task::allocate_root()) EmptyTask();
a.set_ref_count(1);
tbb::parallel_for( tbb::blocked_range( 0, nrTasks), SpawnTasks( n, pSums, &a));
a.wait_for_all();
a.destroy( a);
tbb::tick_count t1 = tbb::tick_count::now();
std::cout << "Time Taken :: " << (t1 - t0).seconds() << std::endl;

std::cout << "Validating Results .. " << std::endl;
long sum = SerialFib( n);
for( int i = 0; i < nrTasks; ++i) {
if( sum != pSums[i * JumpFactor]) {
std::cout << "Error in the Fibnocci Calculation for task " << i << std::endl;
throw;
}
}
std::cout << "Validation Passed" << std::endl;
tbb::cache_aligned_allocator().deallocate( reinterpret_cast(pSums), 0);
return 0;
}

And here are the results

1 0.071
10 0.708
100 7.06
1000 70.66


I have no clue whats going on any way. Why is this better than program 2?
0 Kudos
Shankar1
Beginner
893 Views
Quoting - Dmitriy Vyukov
Are you sure that there is performance degradation?
Here is the results for the first program I obtained on my quad-core:
Computing Fib( 36 )
Number of Tasks submitted by input thread => 1.
Time Taken :: 0.0789379
Number of Tasks submitted by input thread => 10.
Time Taken :: 0.797832
Number of Tasks submitted by input thread => 100.
Time Taken :: 8.13045
Number of Tasks submitted by input thread => 1000.
Time Taken :: 79.8137

And here is for the second program:
Computing Fib( 36 )
Number of Tasks submitted by input thread => 1.
Time Taken :: 0.0798096
Number of Tasks submitted by input thread => 10.
Time Taken :: 0.796995
Number of Tasks submitted by input thread => 100.
Time Taken :: 8.133
Number of Tasks submitted by input thread => 1000.
Time Taken :: 79.9169

I do not see any significant performance degradation.
Please, re-measure the first program.

I did re-measure everything again now. The results are still the same.

I use Visual Studio 2005. Does this have to do anything with the compiler optimizations?
0 Kudos
Dmitry_Vyukov
Valued Contributor I
893 Views
Quoting - Shankar
And here are the results

1 0.071
10 0.708
100 7.06
1000 70.66


I have no clue whats going on any way. Why is this better than program 2?

I still get the same results for this program:

Computing Fib( 36 )
Number of Tasks submitted by input thread => 1.
Time Taken :: 0.0798366
Number of Tasks submitted by input thread => 10.
Time Taken :: 0.995838
Number of Tasks submitted by input thread => 100.
Time Taken :: 7.97388
Number of Tasks submitted by input thread => 1000.
Time Taken :: 79.7281

They are all pretty much the same on my quad-core machine...

0 Kudos
Shankar1
Beginner
893 Views
FYI. Intel Core2 Quad CPU Q6600 @ 2.40 GHz 2.39 GHz, 3.86GB of RAM is the hardware that I use.
0 Kudos
Shankar1
Beginner
893 Views
Quoting - Dmitriy Vyukov

I still get the same results for this program:

Computing Fib( 36 )
Number of Tasks submitted by input thread => 1.
Time Taken :: 0.0798366
Number of Tasks submitted by input thread => 10.
Time Taken :: 0.995838
Number of Tasks submitted by input thread => 100.
Time Taken :: 7.97388
Number of Tasks submitted by input thread => 1000.
Time Taken :: 79.7281

They are all pretty much the same on my quad-core machine...

Strange that the same code behaves differently on our machines. More confusions now.
As I said before I have no clue :)
0 Kudos
Dmitry_Vyukov
Valued Contributor I
893 Views
Quoting - Shankar
I have no clue whats going on any way. Why is this better than program 2?

There is some difference as to how tasks are allocated and spawned.
In program 2 all root fib tasks are allocated in one thread, thus some potential for false sharing. Also all the tasks are initially placed into single task deque, thus some increased contention during stealing.
Version with parallel_for allocate and spawn tasks in a distributed manner, so no above-mentioned problems.
BUT I do NOT think that above-mentioned problems account for such a big performance difference, because most of the time is spent in fibonachi calculation anyway.


0 Kudos
Dmitry_Vyukov
Valued Contributor I
893 Views
Quoting - Dmitriy Vyukov
There is some difference as to how tasks are allocated and spawned.
In program 2 all root fib tasks are allocated in one thread, thus some potential for false sharing. Also all the tasks are initially placed into single task deque, thus some increased contention during stealing.
Version with parallel_for allocate and spawn tasks in a distributed manner, so no above-mentioned problems.
BUT I do NOT think that above-mentioned problems account for such a big performance difference, because most of the time is spent in fibonachi calculation anyway.


You may also try following code:

[cpp]struct SpawnTask : tbb::task
{
    int count;
    int n;
    long* result;

    SpawnTask(int count, int n, long* result)
        : count(count)
        , n(n)
        , result(result)
    {}

    virtual tbb::task* execute()
    {
        if (count == 1)
        {
            set_ref_count(2);
            spawn_and_wait_for_all(*new(allocate_child()) FibTask( n, result));
        }
        else if (count == 2)
        {
            set_ref_count(3);
            spawn(*new(allocate_child()) FibTask( n, result));
            spawn_and_wait_for_all(*new(allocate_child()) FibTask( n, result + JumpFactor));
        }
        else
        {
            int count2 = count / 2;
            set_ref_count(3);
            spawn(*new(allocate_child()) SpawnTask(count2, n, result));
            spawn_and_wait_for_all(*new(allocate_child()) SpawnTask(count - count2, n, result + count2 * JumpFactor));
        }
        return 0;
    }
};
[/cpp]

p.s. I also test on Q6600.
0 Kudos
Shankar1
Beginner
893 Views
Quoting - Dmitriy Vyukov

You may also try following code:

[cpp]struct SpawnTask : tbb::task
{
int count;
int n;
long* result;

SpawnTask(int count, int n, long* result)
: count(count)
, n(n)
, result(result)
{}

virtual tbb::task* execute()
{
if (count == 1)
{
set_ref_count(2);
spawn_and_wait_for_all(*new(allocate_child()) FibTask( n, result));
}
else if (count == 2)
{
set_ref_count(3);
spawn(*new(allocate_child()) FibTask( n, result));
spawn_and_wait_for_all(*new(allocate_child()) FibTask( n, result + JumpFactor));
}
else
{
int count2 = count / 2;
set_ref_count(3);
spawn(*new(allocate_child()) SpawnTask(count2, n, result));
spawn_and_wait_for_all(*new(allocate_child()) SpawnTask(count - count2, n, result + count2 * JumpFactor));
}
return 0;
}
};
[/cpp]

p.s. I also test on Q6600.

what is count here that you pass to the constructor. Also how is the main program written that uses SpawnTasks. Does it use parallel_for?
0 Kudos
Dmitry_Vyukov
Valued Contributor I
893 Views
Quoting - Shankar
what is count here that you pass to the constructor. Also how is the main program written that uses SpawnTasks. Does it use parallel_for?

No, just:
SpawnTask& a = *new(tbb::task::allocate_root()) SpawnTask(nrTasks, n, pSums);
tbb::task::spawn_root_and_wait(a);

0 Kudos
Dmitry_Vyukov
Valued Contributor I
893 Views
Run following test:

[cpp]int main()
{
    long n = 36;
    tbb::task_scheduler_init init;
    int task_counts[] = {10, 50, 100};

    std::cout << "count:t";
    for (int idx = 0; idx != 3; idx += 1)
        std::cout << task_counts[idx] << "t";
    std::cout << std::endl;

    for (int test_count = 0; test_count != 3; test_count += 1)
    {
        std::cout << "#1:t";
        for (int idx = 0; idx != 3; idx += 1)
        {
            int nrTasks = task_counts[idx];
            tbb::tick_count t0 = tbb::tick_count::now();

            for( int i = 0; i < nrTasks; ++i)
                long result = ParallelFib( n);

            tbb::tick_count t1 = tbb::tick_count::now();
            std::cout << std::setprecision(4) << (t1 - t0).seconds() << "t";
        }
        std::cout << std::endl;

        std::cout << "#2:t";
        for (int idx = 0; idx != 3; idx += 1)
        {
            int nrTasks = task_counts[idx];
            tbb::tick_count t0 = tbb::tick_count::now();

            long* pSums = reinterpret_cast( tbb::cache_aligned_allocator().allocate( nrTasks * cacheLineSize));
            memset( pSums, 0, nrTasks * cacheLineSize);
            tbb::empty_task& a = *new( tbb::task::allocate_root()) tbb::empty_task();
            a.set_ref_count(1);
            for( int i = 0; i < nrTasks; ++i) {
                FibTask& b = *new ( a.allocate_additional_child_of( a)) FibTask( n, &pSums[i * JumpFactor]);
                a.spawn( b);
            }
            a.wait_for_all();
            a.destroy( a);
            tbb::cache_aligned_allocator().deallocate( reinterpret_cast(pSums), 0);

            tbb::tick_count t1 = tbb::tick_count::now();
            std::cout << std::setprecision(4) << (t1 - t0).seconds() << "t";
        }
        std::cout << std::endl;

        std::cout << "#3:t";
        for (int idx = 0; idx != 3; idx += 1)
        {
            int nrTasks = task_counts[idx];
            tbb::tick_count t0 = tbb::tick_count::now();

            long* pSums = reinterpret_cast( tbb::cache_aligned_allocator().allocate( nrTasks * cacheLineSize));
            memset( pSums, 0, nrTasks * cacheLineSize);
            tbb::empty_task& a = *new( tbb::task::allocate_root()) tbb::empty_task();
            a.set_ref_count(1);
            tbb::parallel_for( tbb::blocked_range( 0, nrTasks), SpawnTasks( n, pSums, &a));
            a.wait_for_all();
            a.destroy( a);
            tbb::cache_aligned_allocator().deallocate( reinterpret_cast(pSums), 0);

            tbb::tick_count t1 = tbb::tick_count::now();
            std::cout << std::setprecision(4) << (t1 - t0).seconds() << "t";
        }
        std::cout << std::endl;

        std::cout << "#4:t";
        for (int idx = 0; idx != 3; idx += 1)
        {
            int nrTasks = task_counts[idx];
            tbb::tick_count t0 = tbb::tick_count::now();

            long* pSums = reinterpret_cast( tbb::cache_aligned_allocator().allocate( nrTasks * cacheLineSize));
            memset( pSums, 0, nrTasks * cacheLineSize);
            SpawnTask& a = *new(tbb::task::allocate_root()) SpawnTask(nrTasks, n, pSums);
            tbb::task::spawn_root_and_wait(a);
            tbb::cache_aligned_allocator().deallocate( reinterpret_cast(pSums), 0);

            tbb::tick_count t1 = tbb::tick_count::now();
            std::cout << std::setprecision(4) << (t1 - t0).seconds() << "t";
        }
        std::cout << std::endl;
    }
}
[/cpp]

0 Kudos
Dmitry_Vyukov
Valued Contributor I
893 Views
Quoting - Dmitriy Vyukov
Run following test:


My results are:
[cpp]count:  10      50      100
#1:     0.9601  3.988   7.95
#2:     0.7952  3.998   7.968
#3:     0.7949  4.003   7.92
#4:     0.7897  3.953   7.921
#1:     0.8067  3.987   7.979
#2:     0.8078  4.074   8.31
#3:     0.8302  4.206   8.468
#4:     0.8568  4.282   8.651
#1:     0.8695  4.392   8.858
#2:     0.8912  4.458   8.983
#3:     0.9028  4.527   9.135
#4:     0.9097  4.583   9.253[/cpp]

MSVC2008, release build, Q6600, TBB2.2
On first run all variants take 7.9 secs.

0 Kudos
Shankar1
Beginner
893 Views
Quoting - Dmitriy Vyukov

My results are:
[cpp]count:  10      50      100
#1: 0.9601 3.988 7.95
#2: 0.7952 3.998 7.968
#3: 0.7949 4.003 7.92
#4: 0.7897 3.953 7.921
#1: 0.8067 3.987 7.979
#2: 0.8078 4.074 8.31
#3: 0.8302 4.206 8.468
#4: 0.8568 4.282 8.651
#1: 0.8695 4.392 8.858
#2: 0.8912 4.458 8.983
#3: 0.9028 4.527 9.135
#4: 0.9097 4.583 9.253[/cpp]

MSVC2008, release build, Q6600, TBB2.2
On first run all variants take 7.9 secs.

[cpp]
Here are the results :

count: 10 50 100 #1: 0.8022 4.003 8.152 #2: 0.8 4.119 8.009 #3: 0.7991 3.989 7.98 #4: 0.8006 3.986 7.981 #1: 0.7984 3.993 7.986 #2: 0.7992 4.004 8.002 #3: 0.7984 3.994 7.992 #4: 0.8007 4.145 7.996 #1: 0.7987 3.992 7.973 #2: 0.801 4.006 8.006 #3: 0.7987 3.996 7.983 #4: 0.7992 3.985 8[/cpp]
They look exactly like the results you have got. :)

However I would want you to try running this file TestScheduler.cpp.
What I have done here is that I have commented the options 2, 3 and 4 in your code. I know this might sound weird but the results when I run these are as follows
[cpp]count:  10        50       100
#1:     0.7038  3.512   7.043
#1:     0.697    3.517   6.991
#1:     0.7063  3.512   7.036

I would like to know if you get the same results too as mine. [/cpp]
0 Kudos
Dmitry_Vyukov
Valued Contributor I
894 Views
Humm.... this is getting interesting.

When comment out everything except #1 I still get the same result - 8 seconds.
Btw, I switched to MSVC2005, and tried to play with compiler switches. Results are the same - all variants take roughly 8 seconds.

Humm... try to investigate assembly of FibTask::execute() when everything except #1 is commented out and when it is not. Is there any difference? Also you may try to turn off LTCG (Link time code generation) and move FibTask::execute() to another cpp file.

0 Kudos
Shankar1
Beginner
893 Views
Quoting - Dmitriy Vyukov
Humm.... this is getting interesting.

When comment out everything except #1 I still get the same result - 8 seconds.
Btw, I switched to MSVC2005, and tried to play with compiler switches. Results are the same - all variants take roughly 8 seconds.

Humm... try to investigate assembly of FibTask::execute() when everything except #1 is commented out and when it is not. Is there any difference? Also you may try to turn off LTCG (Link time code generation) and move FibTask::execute() to another cpp file.


Hi Dmitriy,

I think this suggestion of yours helped in some way. I did the following steps

* Moved the FibTask, Spawner and SpawnTask classes to a file named FibTask.h. I only kept only the declarations of the methods int the .h file and moved the implementation of the methods to FibTask.cpp.

* I also moved SerialFib and ParallelFib declarations to FibTask.h and their implementation to FibTask.cpp

* Now only my main function lies in the file TestMain.cpp and the code is unchanged.


And then I switched off /GL and /LTCG compiler options and ran the two tests( one where only #1 is run and other where all #1,2,3,4 are run).
I switched on /GL and /LTCG and ran the two tests again.


Here are the results now


[cpp]1. Program compiled by switching off /GL and/LTCG options

option A Running only # 1 ( #2, #3, #4 commented)
count: 10 50 100
#1: 0.7955 3.979 7.961
#1: 0.7957 3.981 7.997
#1: 0.8337 4.331 7.994

option B Running #1, #2, #3, #4
count: 10 50 100
#1: 0.7576 3.771 7.517
#2: 0.75 3.742 7.499
#3: 0.7944 3.835 7.518
#4: 0.7524 3.705 7.483
#1: 0.7438 3.688 7.367
#2: 0.7357 3.664 7.37
#3: 0.7383 3.686 7.404
#4: 0.7541 3.761 7.48
#1: 0.7537 3.724 7.427
#2: 0.7537 3.731 7.454
#3: 0.7507 3.738 7.504
#4: 0.7526 3.72 7.853

2. Program compiled by switching on /GL and/LTCG options

option A Running only # 1 ( #2, #3, #4 commented)
count: 10 50 100
#1: 0.6983 3.493 6.994
#1: 0.6994 3.493 6.981
#1: 0.7032 3.508 7.013

option B Running #1, #2, #3, #4
count: 10 50 100
#1: 0.7027 3.519 6.995
#2: 0.6967 3.494 7
#3: 0.6969 3.491 7
#4: 0.7008 3.489 6.993
#1: 0.6984 3.508 7.001
#2: 0.6977 3.497 6.99
#3: 0.7027 3.492 6.99
#4: 0.6986 3.488 6.989
#1: 0.7008 3.498 7.016
#2: 0.6981 3.49 6.99
#3: 0.7016 3.495 6.985
#4: 0.6992 3.495 7.112


[/cpp]
And then I switched off /GL and /LTCG compiler options and ran the two tests( one where only #1 is run and other where all #1,2,3,4 are run).

So what has helped in your suggestion is that you told me to move the implementation of FibTask, Spawner and SpawnTask to another .cpp file.

And now all the different options(i.e #1, #2, #3 , #4) perform equally(as you expect) and in all of them each task takes only 0.070 sec ( as I want).

Atleast the problem is solved now. But Im not clear as to why it got solved by moving implementation to another cpp file?

Is this because of the task's vtable replication or something? Because I have ran into that problem when I tried to wrap the tbb::task class(by deriving from tbb::task) and overrided a virtual method note_affinity providing the implementation in the .h file itself. But the solution came in from the comments on top of the function void task::note_affinity( affinity_id ) defined in task.cpp file.

0 Kudos
Dmitry_Vyukov
Valued Contributor I
892 Views
Quoting - Shankar
Atleast the problem is solved now.


Cool!

Quoting - Shankar
But Im not clear as to why it got solved by moving implementation to another cpp file?

I think compiler apply some optimization to FibTask::execute() based on some sophisticated condition.

0 Kudos
Reply