How to organize openMP in different threads?

Wang_Wang · ‎01-30-2010

There are 3 to 8 threads in our project working parallel like a pipe line.
Now we want to transfer to OpenMp beacuse of poor synchronization leading low cpu efficiency.
Then how to organize the number of working process in each thread?
Assuming the number of cores is 8.

Michael_K_Intel2 · ‎01-31-2010

Hi!

I understand your question as: You want to stop using your threading model and switch to OpenMP and implement the pipeline with OpenMP. Unfortunately, OpenMP is not very good with pipeline parallelism. Though, OpenMP supports "tasks" that you could use, you cannot describe dependencies between the tasks to ensure pipeline semantics.

If you're saying, you've got a problem with poor synchronization, then please post some more information. Maybe one of the experts here can help you with changing the synchronization such that it works much better in your application.

Cheers,

-michael

Wang_Wang · ‎02-01-2010

Thank you michael.

We can't cast away pipeline model in our project.

The problem is cpu occupation is 20% at least and 70% at most now for synchronization between threads.

So we try to use openMP in each thread wherever possible even if it is not well supported.

In this way,we expect cpu work efficiency will grow and each thread will run faster.

In this scene will it work worse with openMP?

jimdempseyatthecove · ‎02-02-2010

Can you describe your data and processing flow in general terms?

OpenMP is a different parallelization technique than assigning process to thread. At the beginning of the application a pool of OpenMP threads is established automatically using defaults (typically number of hardware threads), but the thread count can be overridden by environment variable or programmatically (at the very start of the application).

In OpenMP threads do not have processes associated with them. Instead the application draws from the pool of OpenMP threads on demand. The pre-OpenMP 3.0 typically looked like:

run serial
fan out to n threads in parallel
fan back to serial
fan out to parallel
fan back to serial

or diagramed

run serial - fan out in parallel - fan in to serial - fan out to parallel - fan in to serial

OpenMP 2.0 and earlier had an option (/nowait) that permitted some overlap (if you were careful)

run serial - fan out in parallel
- back to serial - fan out to parallel
- back to serial

In the /nowait the master thread is permitted to continue on in serial while the parallel threads finish up (i.e. no barrier). And the other threads are permitted to enter the next parallel region (assuming no barriers).

OpenMP 3.0 Added a feature called a TaskQ which permits somewhat asynchronous tasks to be run. Similar to what your current threading model, except the OpenMP threads are not bound to the task (process). Any available thread in the OpenMP thread pool is a candidate for running the task.

There are other tasking techniques you should explore while you are at this transition point in your application. I suggest you look at Threading Building Blocks (TBB). This is a tasking based system using a thread pool. It is very good at computationally oriented problems and has a broad range of platform support. I am unfamiliar with Cilk++, it is worth researching, and maybe someone familiar with Cilk++ can present an argument for Cilk++ here. You can also investigate using a product I am involved with, QuickThread (www.quickthreadprogramming.com), which currently available for Windows based platforms.

When your application has significant I/O you might take a look at QuickThread, which is exceptionally good at combining I/O together with computational tasks using parallel pipelines.

As an example, consider the PARSEC Body Tracking benchmark (Princeton University) where it reads in image frames from multiple cameras from different perspectives and computes body position as a person walks through the field of view of the four cameras. This problem is an excellent example of why you would consider a parallel pipeline technique.

This benchmark used 261 frames of images captured by four cameras 1,044 total input frames, combines the images from the four cameras and computes the body position of a person walking within the field of view of the cameras. A vector of the resulting body positions is written to output files (along with images if that option is selected). This is a classic case where a complete process parallel pipeline would work best. Lets look at the results:

Dell R610 with 2x Xeon 5570 8 Cores w/HT (16 threads)

Frames per second (x4 to get number of images files processed/second)

Serial 1.404297905
OpenMP 7.706389512 (5.487717x serial)
TBB 9.325092 (6.640394x serial)
Win32 threads 9.066907524 (6.456541x serial)
QuickThread 12.0855714 (8.606131x serial)

Interestingly TBB used its parallel pipeline too. However, unlike the TBB parallel pipeline, the QuickThread uses two classes of threads (compute and I/O) and is NUMA aware, although in this case the NUMA awareness was not a major factor. The QuickThread parallel pipeline incorporatesI/O part of the processinto the pipeline whereas TBB did not (cannot effectively at the current time). The difference is dramatic. Another surprising difference in coding, was QuickThread essentially ran the serial code in parallel on seperate frames. This made the coding conversion substantially easier.

For more information on QuickThread you can look at my blog or website.

Jim Dempsey

Grant_H_Intel · ‎02-12-2010

One minor correction to Jim's post:

OpenMP 3.0 did not add TaskQ, Intel added it as an extension for only their compilers years before OpenMP 3.0 was released. OpenMP 3.0 did add a feature called Task, but that is not apparently what was used to derive the numbers in Jim's post, nor the numbers on Jim's blog. Intel's implementation of Task is more efficient than TaskQ and is standardized, so Task should be used instead of TaskQ. TaskQ is still provided for legacy codes.

- Grant