tbb slow on 64-bit machine

nitinsayare · ‎09-30-2009

HiBartlomiej,

I am using TBB2.1 on Linux. I have observed that tbb containers run slow on 64-bit machine as compared to on 32-bit machine. I am building tbb using g++ compiler.I have used -O3 -DNDEBUG compiler options to compile the benchmarking code. Is there anything which I am missing? Or Is there any way to increase the speed of tbb containers on 64-bit machine?

Note : I have RHEL4 on 32-bit machine and Fedora11 on 64-bit machine.

Thanking you in anticipation
Nitin Sayare

Alexey-Kukanov · ‎10-03-2009

Nitin,

There is not enough information to make any judgement or try any experiment. To my best knowledge, there are no known issues with the performance of containers on 64 bit machines compared to 32 bit, neither there are special knobs to throttle the performance.
As performance (of any app or becnhmark) depends on a number of factors, for any substantial help additional info is essential: the benchmarking code itself, the timings, possibly some HW characteristics (the number of CPUs, cores, etc).

Bartlomiej · ‎10-05-2009

Quoting - Alexey Kukanov (Intel)

...the benchmarking code itself, the timings, possibly some HW characteristics (the number of CPUs, cores, etc).

In particular: what version of GCC do you have on both systems and what verison of GLIBC is used there. It is probably not the case as Fedora 11 has rather up-to-date libraries, but still worth checking what-do-you-compare-with-what. And glibc number has a big influence on memory allocation/deallocation efficiency - in particular for multithreaded computing.

nitinsayare · ‎10-08-2009

Hi All,

Thanks for your early reply.

I am having gcc 3.4.6 and glibc 2.3.4 on 32-bit(Intel pentium R) RHEL4 machine and gcc 4.4.0 and glibc 2.10.1 on 64-bit(Intel pentium R) fedora 11 machine. Both the machines are having dual core. Benchmarking application is simply inserting ascending numbers (say from 1 - 10000) into the int vector. It has a seperate function Find() (written by me)whichiterates through the vector using const_iterator and returns the index of the element found in the vector. Suppose , vector has 10000 elements (starting form1 - 10000) then I am trying to find the index of value 5000(middle value) and 9999(last but one value) and measuring the time taken by the Find() function to retun the index of the value.

I think this is a very simple benchmarking code for concurrent vector. but, still it is running slow on 64-bit machine.

Any clue, why is it so?

Thanking you
Nitin Sayare

Bartlomiej · ‎10-08-2009

Quoting - nitinsayare

Any clue, why is it so?

Humm, even less than before the explanation - anything seems better on the 64-bit architecture.
Might I yet ask - what do you mean by "slower" - how slow it is? How do you measure the performance? What are the exact times?

Regards

nitinsayare · ‎10-09-2009

Hi all,

My Find function is very simple, it is as below.

typedef concurrent_vector > IntVector;
IntVector iv; //global vector variable.

int Find(int elementToFind)
{
int index = 0;
for (IntVector::const_iterator it = iv.begin(); itr < iv.end(); ++itr)
{
if (*itr == elementToFind)
{
index = itr - iv.begin();
break;
}
}
return index;
}

As explained before, vector iv contains 10Lacselements (i.e. from 1- 10Lacs). Now, I am trying to find the index of element 5,00,000 and 9,99,999. I am measuring the time taken by Find() function to return the index of these values.
And I am getting3 - 4times slower performance. Time is being measured using tick_count variable.

The figures are as below. time (in sec)
On 32-bit machine On 64-bit machine
Time to find index of middle element 0.0052690.0192848
i.e. value 5,00,000

Time to find index of last but one element0.010935 0.0407119
i.e. value 9,99,999

The results for vector withcache_aligned_allocator are 3 - 8 times slower.

Thanks
Nitin Sayare

Bartlomiej · ‎10-10-2009

Well, I don't know precisely, but here's what I can suggest.

(i) Look at Alexey Kukanov's blog: http://software.intel.com/en-us/blogs/2008/03/04/why-a-simple-test-can-get-parallel-slowdown/ to see how to improve compiler optimization when using parallel_for.
(ii) Check (if you haven't yet), how does parallel_for behave when no concurrent_vector is used - just an iteration.
(iii) The times you show are so short that it can be just a random fluctuation. Try to obtain osomething that lasts at least a few seconds (better minutes); than you'll compare.

I hope at least one of my advices will occur helpful.
Best regards