Parallel For correct usage

fanaticlatic · ‎08-13-2009

Hello folks,

I have recently been working on creating a particle system using tbb 2.1 for the threading. The source and demo application can be found on my blog at:

http://thehinch.spaces.live.com

Currently the parallel version of the code seems between 5-10 FPS slower than a standard version.

I would like to know if I have used the parallel_for construct correctly and whether I should also be using a tbb container to store my particles which are currently stored in a dynamically allocated array. As the arrays size does not change over the lifetime of the demo I saw no reason to implement it as a vector container.

I am currently under the impression that the Parallel_For cuts the for loop into chunks which are then assigned to a particular thread and run in parallel. I also believe the main app thread is used as well, and that when all chunks are complete the program can run serially again. If this was not the case then the slowdown could be put down to the lock placed on the particle vertex buffer during it's update, which is outside and after the Parallel_For.

Please do download the code have a play as any and all advice and help as to how I can maximise the benefits of parallelism is greatly appreciated.

Thanks in advance,

Mark Hinchcliffe.

Anton_Pegushin · ‎08-14-2009

Quoting - fanaticlatic

Hello folks,

I have recently been working on creating a particle system using tbb 2.1 for the threading. The source and demo application can be found on my blog at:

http://thehinch.spaces.live.com

Currently the parallel version of the code seems between 5-10 FPS slower than a standard version.

I would like to know if I have used the parallel_for construct correctly and whether I should also be using a tbb container to store my particles which are currently stored in a dynamically allocated array. As the arrays size does not change over the lifetime of the demo I saw no reason to implement it as a vector container.

I am currently under the impression that the Parallel_For cuts the for loop into chunks which are then assigned to a particular thread and run in parallel. I also believe the main app thread is used as well, and that when all chunks are complete the program can run serially again. If this was not the case then the slowdown could be put down to the lock placed on the particle vertex buffer during it's update, which is outside and after the Parallel_For.

Please do download the code have a play as any and all advice and help as to how I can maximise the benefits of parallelism is greatly appreciated.

Thanks in advance,

Mark Hinchcliffe.

Hello Mark,

your understanding of TBB behavior is correct, it does split work recursively and then allows worker threads to execute tasks in parallel and balance the load by stealing tasks from one another. A slow-down in comparison to a serial application run time can be caused by a number of reasons: nonoptimal choice of the threading model, oversubscription, excessive usage of synchronization, usage of blocking API, incorrect granularity of parallelism, bad data locality/cache behavior. And this is just to name a few obvious ones. I would suggest that you try profiling the application using either Intel Amplifier or Intel Thread Profiler + VTune (both of the tools can be used freely for a 30 days evaluation period). To simply find where your parallel application is slower than the serial one, compare the corresponding results of tools analysis. In addition to that the tools might be able to help you enhance the performance of your parallel application even further.