I am using TBB2.1 on Linux. I have observed that tbb containers run slow on 64-bit machine as compared to on 32-bit machine. I am building tbb using g++ compiler.I have used -O3 -DNDEBUG compiler options to compile the benchmarking code. Is there anything which I am missing? Or Is there any way to increase the speed of tbb containers on 64-bit machine?
Note : I have RHEL4 on 32-bit machine and Fedora11 on 64-bit machine.
Thanking you in anticipation
There is not enough information to make any judgement or try any experiment. To my best knowledge, there are no known issues with the performance of containers on 64 bit machines compared to 32 bit, neither there are special knobs to throttle the performance.
As performance (of any app or becnhmark) depends on a number of factors, for any substantial help additional info is essential: the benchmarking code itself, the timings, possibly some HW characteristics (the number of CPUs, cores, etc).
Thanks for your early reply.
I am having gcc 3.4.6 and glibc 2.3.4 on 32-bit(Intel pentium R) RHEL4 machine and gcc 4.4.0 and glibc 2.10.1 on 64-bit(Intel pentium R) fedora 11 machine. Both the machines are having dual core. Benchmarking application is simply inserting ascending numbers (say from 1 - 10000) into the int vector. It has a seperate function Find() (written by me)whichiterates through the vector using const_iterator and returns the index of the element found in the vector. Suppose , vector has 10000 elements (starting form1 - 10000) then I am trying to find the index of value 5000(middle value) and 9999(last but one value) and measuring the time taken by the Find() function to retun the index of the value.
I think this is a very simple benchmarking code for concurrent vector. but, still it is running slow on 64-bit machine.
Any clue, why is it so?
Humm, even less than before the explanation - anything seems better on the 64-bit architecture.
Might I yet ask - what do you mean by "slower" - how slow it is? How do you measure the performance? What are the exact times?
My Find function is very simple, it is as below.
IntVector iv; //global vector variable.
int Find(int elementToFind)
int index = 0;
for (IntVector::const_iterator it = iv.begin(); itr < iv.end(); ++itr)
if (*itr == elementToFind)
index = itr - iv.begin();
As explained before, vector iv contains 10Lacselements (i.e. from 1- 10Lacs). Now, I am trying to find the index of element 5,00,000 and 9,99,999. I am measuring the time taken by Find() function to return the index of these values.
And I am getting3 - 4times slower performance. Time is being measured using tick_count variable.
The figures are as below. time (in sec)
On 32-bit machine On 64-bit machine
Time to find index of middle element 0.0052690.0192848
i.e. value 5,00,000
Time to find index of last but one element0.010935 0.0407119
i.e. value 9,99,999
The results for vector withcache_aligned_allocator are 3 - 8 times slower.
(i) Look at Alexey Kukanov's blog: http://software.intel.com/en-us/blogs/2008/03/04/why-a-simple-test-can-get-parallel-slowdown/ to see how to improve compiler optimization when using parallel_for.
(ii) Check (if you haven't yet), how does parallel_for behave when no concurrent_vector is used - just an iteration.
(iii) The times you show are so short that it can be just a random fluctuation. Try to obtain osomething that lasts at least a few seconds (better minutes); than you'll compare.
I hope at least one of my advices will occur helpful.