Performance issue

gdamiand · ‎10-22-2010

Hi,

I have a Intel Core i7 CPU with 4 cores; running on linux 64 bits system.

I compile and run the fibonnaci example given in tbb (./fibonacci 1000 ), and the multi thread version is always longer than the sequential one.
Moreover, the version with 4 threads is longer than the version with 2 threads...

(In fact I have the same problem with my own code and since I don't understand what happen I tried tbb example)

Any idea where is the problem ?

Regards

Guillaume

RESULTS:
===========
Fibonacci numbers example. Generating 1000 numbers..
Serial loop - in 0.940804 msec
Serial matrix - in 5.613300 msec
Serial vector - in 22.286680 msec
Serial queue - in 46.238315 msec

Threads number is 1
Shared serial (mutex) - in 9.664982 msec
Shared serial (spin_mutex) - in 4.719505 msec
Shared serial (queuing_mutex) - in 11.338451 msec
Shared serial (Conc.HashTable) - in 162.153952 msec
Parallel while+for/queue - in 59.207958 msec
Parallel pipe/queue - in 90.881638 msec
Parallel reduce - in 4.703301 msec
Parallel scan - in 4.496614 msec
Parallel tasks - in 32.981415 msec

Threads number is 2
Shared serial (mutex) - in 40.650196 msec
Shared serial (spin_mutex) - in 8.217491 msec
Shared serial (queuing_mutex) - in 61.209658 msec
Shared serial (Conc.HashTable) - in 285.627157 msec
Parallel while+for/queue - in 64.834662 msec
Parallel pipe/queue - in 197.587403 msec
Parallel reduce - in 3.358554 msec
Parallel scan - in 5.146414 msec
Parallel tasks - in 18.081198 msec

Threads number is 4
Shared serial (mutex) - in 77.744175 msec
Shared serial (spin_mutex) - in 13.738133 msec
Shared serial (queuing_mutex) - in 95.684056 msec
Shared serial (Conc.HashTable) - in 232.317561 msec
Parallel while+for/queue - in 55.820653 msec
Parallel pipe/queue - in 188.782422 msec
Parallel reduce - in 5.060632 msec
Parallel scan - in 8.407837 msec
Parallel tasks - in 14.007576 msec
Fibonacci number #1000 modulo 2^64 is 817770325994397771

Anton_M_Intel · ‎10-22-2010

This example is not supposed for performance and scalability measurements, please see what index.html says:

[xml]Overview
This directory contains an example that computes Fibonacci numbers in several
different ways. The purpose of the example is to exercise every include file
and class in Threading Building Blocks.
Most of the computations are deliberately silly and not expected to
show any speedup on multiprocessors.[/xml]

gdamiand · ‎10-22-2010

Ok, thanks. But how can I solve my problem ?

I need to test in the current machine the gain proposed by tbb on sequential method, and studying the effect of number of threads on the result.

Indeed, I have a code using tbb, and compare the parallel version with the sequential one.
On a core 2 duo, the parallel version is 1.8 times faster than the sequential one;
On my new computer (Intel Core i7 CPU), the same code is 1.5 times slower than the sequential version.

I need to understand the reason, and for that I would like to test another code to see if the problem comes from my code or from the couple tbb/cpu ?

Thanks

PS: moreover, the paper given here http://www.intel.com/technology/itj/2007/v11i4/5-foundations/6-results.htm shows a result which clearly links number of threads and speedup on fibonacci which is quite a normal result....

Alexey-Kukanov · ‎10-23-2010

The referenced paper does not tell about that TBB example which contains abouthalf a dozen of _different_ implementations, and as Anton correctly said it was not designed for any performance/scalability studies. If you wish to check how TBB works on Core i7, you may take another example, e.g. parallel_reduce/primes.

Without knowing anything about your application, nobody can really help you with its problems. At best, you will maybe get a list of typical scalabilityissues (already discussed multiple times at this forum).