Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.
2465 Discussions

Bad performance with fibonacci test on Core 2 Quad Q9550 than Core 2 Duo E6850

gawik
Beginner
409 Views
Hi,

I'm new to TBB. So I've compile the fibonacci appplication includes in the TBB package ( examples/test_all ).
I've compile this application with Visual Studio 2008 in Release and TBB 2.2.

So I've run the application on a Core 2 Duo E6850 (With Windows 7 ) and on a Core 2 Quad Q9550 ( With Windows XP ).

The parameter passed to the fibonacci was 50000.

For the Serial test, the results are :
--------------------------- Core 2 Duo E6850 --------- Core 2 Quad Q9550
- Serial loop | 5196 ms | 5515 ms
- Serial matrix | 92005 ms | 99337 ms
- Serial vector | 277422 ms | 279433 ms
- Serial queue | 1032599 ms | 1039019 ms

As you can see, the Core 2 Duo is better than the Core 2 Quad. But, this result can be logical because the core 2 Duo has a frequency of 3GHz and the Core 2 Quad 2.83GHz.

With the test with 1 thread, the results are :
------------------------------- Core 2 Duo E6850 --------- Core 2 Quad Q9550
- mutex | 88570 ms | 82464 ms
- spin_mutex | 64184 ms | 62796 ms
- queuing_mutex | 61986 ms | 68640 ms
- Conc.Hastable | 2383970 ms | 2161845 ms
- Parallel while + for | 871339 ms | 828461 ms
- Parallel pipe/queue | 1638076 ms | 1234360 ms
- Parallel reduce | 115842 ms | 120197 ms
- Parallel scan | 118224 ms | 121967 ms
- Parallel tasks | 222433 ms | 228112 ms

Here, In general the Q9550 processor is better than E6850. But we can see that the E6850 is better for Parallel reduce/scan/tasks. Why ?

With the test with 2 threads, the results are :
------------------------------- Core 2 Duo E6850 --------- Core 2 Quad Q9550
- mutex | 180791 ms | 17630195 ms
- spin_mutex | 78605 ms | 102566 ms
- queuing_mutex | 164113 ms | 216254 ms
- Conc.Hastable | 1422683 ms | 2066588 ms
- Parallel while + for | 495898 ms | 556969 ms
- Parallel pipe/queue | 924551 ms | 1409204 ms
- Parallel reduce | 65100 ms | 61047 ms
- Parallel scan | 116001 ms | 120051 ms
- Parallel tasks | 113971 ms | 113051 ms

Here, we are surprised by the bad performance on the Q9550 processor and especially for the mutex test !!
The best performance for Q9550 is for the Parallel reduce test. why ?


With the test with 4 threads, the results are :
------------------------------- Core 2 Duo E6850 --------- Core 2 Quad Q9550
- mutex | 171978 ms | 13282170 ms
- spin_mutex | 73184 ms | 134984 ms
- queuing_mutex | 578664 ms | 733882 ms
- Conc.Hastable | 1419792 ms | 2134029 ms
- Parallel while + for | 552766 ms | 432449 ms
- Parallel pipe/queue | 1179087 ms | 1837455 ms
- Parallel reduce | 64621 ms | 32136 ms
- Parallel scan | 116889 ms | 63675 ms
- Parallel tasks | 114261 ms | 57079 ms

Here, we are another surprised by the bad performance on the Q9550 processor and especially for the mutex test !!
But, The Q9550 processor is better for the Parallel reduce/scan/tasks tests et the time of execution is almost reduce by half. So Here it's seems logical...


So, does anyone have any ideas about the poor performance observed ?
Could you help me.

Thanks a lot

gawik
0 Kudos
6 Replies
Vladimir_P_1234567890
409 Views
Hi Gawik,

fibonacci example is not the best example of scalability:) there is the quotation from its index.html description:

"The purpose of the example is to exercise every include file and class in Threading Building Blocks. Most of the computations are deliberately silly and not expected to show any speedup on multiprocessors."

--Vladimir
0 Kudos
Dmitry_Vyukov
Valued Contributor I
409 Views
> Here, In general the Q9550 processor is better than E6850. But we can see that the E6850 is better for Parallel reduce/scan/tasks. Why ?

Because E6850 has higher frequency; and since the test is executed with only 2 threads, 2 additional threads of Q9550 do not matter.

0 Kudos
Dmitry_Vyukov
Valued Contributor I
409 Views
> Here, we are surprised by the bad performance on the Q9550 processor and especially for the mutex test !!

Nope, we are not.
0 Kudos
ARCH_R_Intel
Employee
409 Views

The fibonacci test is just a quick way to check that the package works. It was never meant to be a serious performance test.

For performance-oriented tests, look at the other tests, particularly:

  • examples/parallel_for/
  • exmaples/parallel_reduce/
  • examples/task_group/
The ones I didn't mention in examples/ are also written for performance, but tend to lose steam because of memory bandwidth or I/O issues.

Another code that uses TBB and has been written for performance is my Seismic Duck. Though I didn't try to make it scale past four cores, because that was enough to meet frame rate requirements. I've started a series of blogs on its implementation and the programming patterns used to get high performance. To use it as a benchmark:
  • Press F to toggle display of the frame rate.
  • Disable the default frame rate limit of 60 frames/sec with these steps:
    1. Select View->Speed
    2. Move the "Frame Rate Limit" slider to infinity.
    The framerate is sensitive to montior resolution, so rates are comparable only done for the same display resolution.
    0 Kudos
    renorm
    Beginner
    409 Views
    In my experience sub_string_finder_extended.cpp scales almost linearly with the number of cores. Make sure you have the same optimization settings on all platforms. Somehow the included Visual Studio solution gets messed during conversion and optimization isn't enabled by default.
    0 Kudos
    gawik
    Beginner
    409 Views
    thank you very much for your answers .. I will continue my investigations
    0 Kudos
    Reply