Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.

For-loop performance, what's wrong?

tbbphreak
Beginner
363 Views
Hello,

I'm new to TBB and just started experimenting with it using tutorials. My first attempt is to test performance of a simple loop over a big array of floats. Once using TBB, and without. Comparing the time required for each tech, it was surprising. Check yourself and correct me if I'm doing somethign wrong:

#include "tbb/task_scheduler_init.h"
#include "tbb/parallel_for.h"
#include "tbb/blocked_range.h"
#include "tbb/tick_count.h"
#include

using namespace tbb;

#define BIGARRSIZE 100000

float big_arr[BIGARRSIZE];

void Foo(float *a)
{
(*a)++;
}

class ApplyFoo {
public:
void operator()(const blocked_range& r) const {
for(size_t i = r.begin(); i != r.end(); i++) {
Foo(&big_arr);
}
}
};

void main()
{
tick_count t0, t1;
int nthreads = 2;

task_scheduler_init init(task_scheduler_init::deferred);
if (nthreads >= 1)
init.initialize(nthreads);

t0 = tick_count::now();
parallel_for(blocked_range(0,BIGARRSIZE), ApplyFoo(), auto_partitioner());
t1 = tick_count::now();
printf("\n*** work took %g seconds ***", (t1 - t0).seconds());

if (nthreads >= 1)
init.terminate();

t0 = tick_count::now();
for (int i = 0; i < BIGARRSIZE; i++)
Foo(&big_arr);
t1 = tick_count::now();
printf("\n*** work took %f seconds ***", (t1 - t0).seconds());


printf("\n");
}

Thanks.
0 Kudos
2 Replies
tbbphreak
Beginner
363 Views
An update to this piece of code:

#include
#include
#include
#include "tbb/task_scheduler_init.h"
#include "tbb/parallel_for.h"
#include "tbb/blocked_range.h"
#include "tbb/tick_count.h"

using namespace tbb;

#define BIGARRSIZE 10000000

float big_arr[BIGARRSIZE];

void Foo(float *a)
{
(*a)++;
}

class ApplyFoo {
public:
void operator()(const blocked_range& r) const {
for(size_t i = r.begin(); i != r.end(); i++) {
Foo(&big_arr);
}
}
};

void ApplyFooRange(int start, int end)
{
for(int i = start; i <= end; i++)
Foo(&big_arr);
}

DWORD WINAPI FooPart1(LPVOID param)
{
ApplyFooRange(0, BIGARRSIZE / 2 - 1);

return 0;
}

DWORD WINAPI FooPart2(LPVOID param)
{
ApplyFooRange(BIGARRSIZE / 2, BIGARRSIZE - 1);

return 0;
}

DWORD threadIDs[4];
HANDLE hThreads[4];

void main()
{
tick_count t0, t1;
int nthreads = 2;

task_scheduler_init init(task_scheduler_init::deferred);
if (nthreads >= 1)
init.initialize(nthreads);

t0 = tick_count::now();
parallel_for(blocked_range(0,BIGARRSIZE), ApplyFoo(), auto_partitioner());
t1 = tick_count::now();
printf("n*** work took %g seconds ***", (t1 - t0).seconds());

if (nthreads >= 1)
init.terminate();

t0 = tick_count::now();
for (int i = 0; i < BIGARRSIZE; i++)
Foo(&big_arr);
t1 = tick_count::now();
printf("n*** work took %f seconds ***", (t1 - t0).seconds());

omp_set_num_threads(2);

t0 = tick_count::now();
#pragma omp default(none) private(i) shared(big_arr)
{
#pragma omp for
for (int i = 0; i < BIGARRSIZE; i++)
#pragma omp atomic
Foo(&big_arr);
}
t1 = tick_count::now();
printf("n*** work took %f seconds ***", (t1 - t0).seconds());

t0 = tick_count::now();
hThreads[0] = CreateThread(NULL, 0, FooPart1, NULL, 0, &threadIDs[0]);
hThreads[1] = CreateThread(NULL, 0, FooPart2, NULL, 0, &threadIDs[1]);

WaitForMultipleObjects(2, hThreads, TRUE, INFINITE);

t1 = tick_count::now();
printf("n*** work took %f seconds ***", (t1 - t0).seconds());

CloseHandle(hThreads[0]);
CloseHandle(hThreads[1]);


printf("n");
}

Try it yourself. Really impressive! :)
0 Kudos
tbbphreak
Beginner
363 Views
Hope my conclusion will help. As the complexity of the inner threaded loop increase, and the size of the array goes up by order of magnitudes, the gain in performance is very apparent, and makes difference in TBB. Creating threads manually can be a lightly faster, but as the code becomes more complicated, the TBB roles and rocks. My vote is TBB are awesome!
0 Kudos
Reply