Re: unknow openmp bottlenecks on 4 X xeon 6 cores (X7460)

malo_sasha · ‎04-22-2009

HI
we have an application that runs pretty well on a server with a 2 X xeon quad cores (E5450). By runnning pretty well , i mean that we have a main for loop that makes a lot of computation on a pretty big in memory array of structures (several hundreds of millions) with the help of openmp and during the execution all the cores are at 100% of use.
We have bought to increase the perfs a server with 4 X xeon 6 cores (X7450). but un fortunately the performance boost is not so big as during execution only 50% - 70% of each core are used. We can not identify the bottlenecks . Could somebody give some helps Thanks

TimP · ‎04-22-2009

Xeon 5450 is rated at 3Ghz 1333FSB, one of the fastest 8 core platforms available until recently. Locality of OpenMP threads can be maintained effectively by KMP_AFFINITY and static (default) scheduling.
By running more threads on a slower memory system, among the possibilities could be load imbalance due to waiting for data transfers. Intel OpenMP provides the openmp-profile link option, which should be useful to help locate the bottlenecks. I haven't heard of anyone testing OpenMP performance on Windows on such a platform. Certainly, a 64-bit OS with plenty of RAM would be required. Those who have tried to use it as an HPC platform (contrary to the specialized purposes for which it was designed) have been sorely disappointed.

malo_sasha · ‎04-23-2009

Quoting - malo.sasha

HI
we have an application that runs pretty well on a server with a 2 X xeon quad cores (E5450). By runnning pretty well , i mean that we have a main for loop that makes a lot of computation on a pretty big in memory array of structures (several hundreds of millions) with the help of openmp and during the execution all the cores are at 100% of use.
We have bought to increase the perfs a server with 4 X xeon 6 cores (X7450). but un fortunately the performance boost is not so big as during execution only 50% - 70% of each core are used. We can not identify the bottlenecks . Could somebody give some helps Thanks

HI

Thanks for the quick answer .
We are under windwos 64 bits with plenty of ram :) , effectively playing with the chunk size of the static schedule type increase a little the performances but still not at the level expected.
We have tried the /Qopenmp_profile option to gather statistics through intel thread profiler in openmp* mode but actually we did not manager to get it working .It says "none of the intel thread profiler openmp* collector's module of interest have been linked to an openmp library" . The openmp code is in a dll and we have add it as an additional module to analize.

edit : the openmp code is inside a dll

TimP · ‎04-23-2009

If you link successfully with /Qopenmp_profile, it should not be necessary to use thread profiler. Running the build which is linked with /Qopenmp_profile should create a file, probably named guide.gvs, which can be examined in a text viewer, or plotted by importing into VTune. It works simply by linking the profiling version of libiomp5 or libguide, so I would hope that the .dll should not be a problem.
If you must use schedule(guided) or the like to improve performance, it's not an ideal solution. KMP_AFFINITY should be effective only in the initial phase of schedule(guided).

jimdempseyatthecove · ‎04-23-2009

Sasha,

Could part of your bottleneck problem be the programming paradigm for OpenMP itself? From my experience with working with OpenMP (since 2005) the OpenMP programming paradigm leads to either undersubscripton of threads or over subscription of threads. It is challenging to distribute the work evenly in complex systems using OpenMP.

OpenMP does have its benefits - relatively easy to integrate and available in many compilers.

A different programming paradigm for you to consider are task pooling systems such as Intel Threading Building Blocks (TBB) http://software.intel.com/en-us/intel-tbb/ or a software tool such as I am working on, QuickThread (QT) http://software.intel.com/file/8639/ and there are a few others (Cilk, etc...)

If you have problems getting the QT document off the Intel site I can email you a current copy of the documents, or for that matter I could email you a Beta test kit.

Jim Dempsey
jim_dempsey@ameritech.net

malo_sasha · ‎04-23-2009

Thaks for the guide.gvs tip , I ve not been impressed by the result produces , as at least in our case it simlply shows graphically what we understood with some other tools. I'm actually evaluating Intel Thread Checker , and that tool seems amazing but sooo slow with a full charge test system.

Quoting - jimdempseyatthecove

Sasha,

Could part of your bottleneck problem be the programming paradigm for OpenMP itself? From my experience with working with OpenMP (since 2005) the OpenMP programming paradigm leads to either undersubscripton of threads or over subscription of threads. It is challenging to distribute the work evenly in complex systems using OpenMP.

OpenMP does have its benefits - relatively easy to integrate and available in many compilers.

A different programming paradigm for you to consider are task pooling systems such as Intel Threading Building Blocks (TBB) http://software.intel.com/en-us/intel-tbb/ or a software tool such as I am working on, QuickThread (QT) http://software.intel.com/file/8639/ and there are a few others (Cilk, etc...)

If you have problems getting the QT document off the Intel site I can email you a current copy of the documents, or for that matter I could email you a Beta test kit.

Jim Dempsey
jim_dempsey@ameritech.net

That was exactly what we where presuming.
The problem is (i think) the shared array of structures (almost 5 gigs in ram) that openmp vs the number of thread do not like at all. But i could not affirme that a native thread implementation wil do better and the " relatively easy to integrate " feature of openmp is really important as in the past we where using native thread and it leads us to some kind of "crapy trickshot" code that was really hard to maintain , debug and ever understand :) .

The lower clock and bus speed could impact also a little but not as much as what we are seeing (still i think).

An implementation i ve think about would be to have a "master" thread that will gather data from the main array and populate some other smaller private arrays that each thread will work on . I yet don't know how to implement it in openmp and if it can reduce our bottlenecks

Thanks for the links, i'll dig in TBB right now.

Regards

jimdempseyatthecove · ‎04-23-2009

Sasha,

I would be difficult for me to offer you specific suggestions without seeing your code (I am available for consultation if you are interested). But I can offer some general suggestions.

From your brief description you data is organizes as an Array Of Structures (AOS). An alternate format is Structure Of Arrays (SOA). WhileAOS organization is execellent for encapsulation itcan bepoor for use of SIMD (SSE n.m available on your X7460)or other streamingprocessing. e.g. GPGPU using CUDA (nVidia) or Brook+ (ATI). It is (may be) unlikely that you would be willing to change the data organization from AOS to SOA unless there was a compelling reason to do so. Let's set the issue of AOS vs SOA asside for now.

Your current system is the 4 x X7460 so your focus should be on:

1) Making best (better) use of SIMD (coding friendly to SSE use). Often minor changes to data organization has major impact on ability to take advantage of SIMD. And this impact is seen by all cores.

2) Reduce (eliminate if possible)the number of critical sections.

3) Reduce the number of ATOMIC operations (push outside loop whenever possible)

4) Reduce the number of times you start and stop parallel regions

5) Partition the work into cache friendly pieces and/or order. This will introduce slightly higher complexity in the code but keeping data in cache makes a big difference in performance.

6) Make use of asychronous programming when possible. (TBB or QT)

7) Cache line align related sets of data (if X and Y are used in an expression arrange so X and Y are in same cache line)

8) double buffer data (to different cache lines) when data advances in time phases

9) more....

And example of 7) used in conjunction with 8) is assume you have a physics type system of objects
Position[3], Velocity[3], Acceleration[3], ExternalForce[3], Mass, ... Assume the objects interact.

The traditional method is:

Make one pass of all objects and calculate the interactions to produce an ExternalForce.

Make a second pass computing the Acceleration, delta Velocity, integration of delta Velocity into Velocity, delta position, integration of delta Position into Position.

At a minor cost of keeping two copies of Position in the object (and if required two copies of V and A) you can consolidate the two passes into one pass (and potentially remove ExternalForce[3] from the object). Remember to place the 2nd copy(s) such that they reside in a different cache line than the 1st copy.

The outer most loop would have a flag indicating which set was the current working set.

Depending on what is being done with your objects, it may be cache advantageios to change your processing sequence from

each thread taking a subset of all objects and then sequentially processing the objects in the subset

to

each thread taking a subset of all object and then take a smaller sub-subset then sequentially processing the objects in the smaller sub-subset. (where the small sub-subset is sized to fit within L1 cache)

Best regards,

im Dempsey
jim_dempsey@ameritech.net

Alain_D_Intel · ‎04-29-2009

Quoting - malo.sasha

HI
we have an application that runs pretty well on a server with a 2 X xeon quad cores (E5450). By runnning pretty well , i mean that we have a main for loop that makes a lot of computation on a pretty big in memory array of structures (several hundreds of millions) with the help of openmp and during the execution all the cores are at 100% of use.
We have bought to increase the perfs a server with 4 X xeon 6 cores (X7450). but un fortunately the performance boost is not so big as during execution only 50% - 70% of each core are used. We can not identify the bottlenecks . Could somebody give some helps Thanks

All people said is true but let's be pragmatic even whitout the code.

1) it doesn't seem to bea bandwidth bottleneck issueas your cpus are not working 100% as before
==> let's have more threads by increasing OMP_NUM_THREADS to 48 (eg) and letsee the result on cpu activity
==> it should icrease

2) I really don't like whenthe performance metric is% of cpu usage ==> have you another metric?
==> with 1) done ,do you see timings going down?

3)If yes ==> OK you win (try to find the best oversubscription with KMP_LIBRARY=turnaround (eg))
if NO ==> we'll need to go further on scalability figures of your application with code description first

Hope this help