Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.
The Intel sign-in experience has changed to support enhanced security controls. If you sign in, click here for more information.

Performance issue


I have a Intel Core i7 CPU with 4 cores; running on linux 64 bits system.

I compile and run the fibonnaci example given in tbb (./fibonacci 1000 ), and the multi thread version is always longer than the sequential one.
Moreover, the version with 4 threads is longer than the version with 2 threads...

(In fact I have the same problem with my own code and since I don't understand what happen I tried tbb example)

Any idea where is the problem ?



Fibonacci numbers example. Generating 1000 numbers..
Serial loop - in 0.940804 msec
Serial matrix - in 5.613300 msec
Serial vector - in 22.286680 msec
Serial queue - in 46.238315 msec

Threads number is 1
Shared serial (mutex) - in 9.664982 msec
Shared serial (spin_mutex) - in 4.719505 msec
Shared serial (queuing_mutex) - in 11.338451 msec
Shared serial (Conc.HashTable) - in 162.153952 msec
Parallel while+for/queue - in 59.207958 msec
Parallel pipe/queue - in 90.881638 msec
Parallel reduce - in 4.703301 msec
Parallel scan - in 4.496614 msec
Parallel tasks - in 32.981415 msec

Threads number is 2
Shared serial (mutex) - in 40.650196 msec
Shared serial (spin_mutex) - in 8.217491 msec
Shared serial (queuing_mutex) - in 61.209658 msec
Shared serial (Conc.HashTable) - in 285.627157 msec
Parallel while+for/queue - in 64.834662 msec
Parallel pipe/queue - in 197.587403 msec
Parallel reduce - in 3.358554 msec
Parallel scan - in 5.146414 msec
Parallel tasks - in 18.081198 msec

Threads number is 4
Shared serial (mutex) - in 77.744175 msec
Shared serial (spin_mutex) - in 13.738133 msec
Shared serial (queuing_mutex) - in 95.684056 msec
Shared serial (Conc.HashTable) - in 232.317561 msec
Parallel while+for/queue - in 55.820653 msec
Parallel pipe/queue - in 188.782422 msec
Parallel reduce - in 5.060632 msec
Parallel scan - in 8.407837 msec
Parallel tasks - in 14.007576 msec
Fibonacci number #1000 modulo 2^64 is 817770325994397771

0 Kudos
3 Replies
This example is not supposed for performance and scalability measurements, please see what index.html says:


This directory contains an example that computes Fibonacci numbers in several different ways. The purpose of the example is to exercise every include file and class in Threading Building Blocks. Most of the computations are deliberately silly and not expected to show any speedup on multiprocessors.[/xml]
Ok, thanks. But how can I solve my problem ?

I need to test in the current machine the gain proposed by tbb on sequential method, and studying the effect of number of threads on the result.

Indeed, I have a code using tbb, and compare the parallel version with the sequential one.
On a core 2 duo, the parallel version is 1.8 times faster than the sequential one;
On my new computer (Intel Core i7 CPU), the same code is 1.5 times slower than the sequential version.

I need to understand the reason, and for that I would like to test another code to see if the problem comes from my code or from the couple tbb/cpu ?


PS: moreover, the paper given here shows a result which clearly links number of threads and speedup on fibonacci which is quite a normal result....

The referenced paper does not tell about that TBB example which contains abouthalf a dozen of _different_ implementations, and as Anton correctly said it was not designed for any performance/scalability studies. If you wish to check how TBB works on Core i7, you may take another example, e.g. parallel_reduce/primes.

Without knowing anything about your application, nobody can really help you with its problems. At best, you will maybe get a list of typical scalabilityissues (already discussed multiple times at this forum).