Best hardware for OpenMP application

Reinaldo_Garcia · ‎01-15-2010

We are considering purchasing a computer to run very large parallelized applications based on OpenMP. We are looking for optimal configuration based on processor (e.g I7 vs Xeon), Cache memory, number of processors, etc.

Any suggestion or experiences you can share would be highly appreciated.

Thanks,

R

TimP · ‎01-15-2010

The distinction between Core i7 and Xeon is simply the number of CPU packages. Core i7 is excellent for supporting up to 4 threads (8, if your application uses HyperThreading effectively); many support 12GB RAM routinely, or even 24GB. Most Xeon 55xx provide 8 cores, and could support 16 threads with HyperThreading, with 48GB RAM not an unusual configuration.

Within a few months, corresponding Westmere models with 12 cores will be available. At that time, for an application to be considered large, it might require the Nehalem-EX model with 4 CPU packages of 6 or 8 cores each, with 128 or 256GB RAM. For OpenMP to run well on the larger numbers of packages and cores, increased attention to memory locality issues is needed.

peterklaver · ‎01-15-2010

Quoting Reinaldo Garcia

For OpenMP to run well on the larger numbers of packages and cores, increased attention to memory locality issues is needed.

I hear you there. Is it reasonable to assume that each of the above configurations has its own, unique memory locality issues? Is it further reasonable to assume that these issues can be addressed through judicious setting of environment variables? Or, asked differently, is the choice of CPU configuration not especially important with respect to Open MP implementation AS LONG AS the memory locality issues are appropriately addressed? PS the new forum software still needs work; I am replying to tim18 but it says I'm quoting the OP (which I could edit manually but that's not the point).

Reinaldo_Garcia · ‎01-16-2010

Thanks so much for your helpful insight. I have a question regarding your remark about the memory issues when using larger numbers of cores. Does it mean that the same code that runs OK on a quad core processor would need to be modified to run on a larger system with multiple processors?

Thanks again,

R

TimP · ‎01-16-2010

If a great deal of foresight has been given to organization, or the application is simple enough, it's certainly possible that an OpenMP application designed for a quad core CPU will scale automatically, even to 4 8-core CPUs, with setting of environment variables (KMP_AFFINITY for Intel OpenMP).

On the other hand, an application which works well on a single multi-core CPU in spite of false sharing, or even latent race conditions, is likely to be a disaster already on a dual CPU.

In an example I discussed earlier, simple use of OpenMP schedule(guided) worked well to balance work between threads only up to 8 threads, but it was possible to scale effectively to at least 24 cores by writing in specific load balancing to keep each chunk local to the same core (and memory bank).

Grant_H_Intel · ‎02-12-2010

Reinaldo,

Much depends on the characteristics of your applications as well when chosing a large system. Some application domainshave very little locality and large data sets,so memory bandwidth is crucial. For these applications, Xeon processors with 3-channel DDR3 would be best to maximize memory bandwidth. Other applications have inherently high data locality, so for these the memory system is less important (2-channel DDR3 is fine), but the size of the on-chip caches (especially the largest level) should be maximized. Hope this helps a bit!

- Grant

Reinaldo_Garcia · ‎03-02-2010

Our applications are computationally and memory intensive finite element models. Typically runs take from several hours to several days on a I7-920 processor. Parallelizing the code with OpenMP has made a huge difference on these processors, but I was wondering how much better performance we could expect using2 Xeon processors instead of on I7.

Thanks

R//G

TimP · ‎03-02-2010

Results from http://topcrunch.org typically show Xeon 5560 (dual quad core) platforms giving 70% more performance than Core I7. On some of these applications, the 6 core CPUs give an additional 25%. Note that most results on that site are for MPI applications, which may gain more than OpenMP from additional sockets.

jimdempseyatthecove · ‎03-03-2010

Reinaldo,

If your FE has memory bandwidth issues (as opposed to memory capacity issues), and if you wish to address this with NUMA node locality, then you must invest some time in partitioning the data such that parts are allocated in a distributed manner accross each NUMA node (and processed by HT threads from within those respective nodes). Using Structures Of Arrays layout can help too as they can take better advantage of SSE capabilities in the processor.

e.g. if your code is currently setup with objects (nodes) containing property variables, say Vec3 ::pos then for Structures Of Arrays layout you do not allocate objects (nodes), Rather you allocate arrays of your property variables (pos, vel, acc, force, ...).

However, on NUMA platform with NUMA allocation, you would pre-slice the property variable arrays by number of NUMA nodes (2, 4, 8...) and allocate those slice domains within those nodde. Then construct your processing loops such that each slice has preferential (or exclusive) processing by threads within the NUMA node.

This requires more programming work, but once you do this for one major loop, it becomes close to a cut and paste operation to setup your next major loop

Some of my FE runs (Space Elevator simulations) would take weeks to complete.

Jim Dempsey

TimP · ‎03-03-2010

Yes, NUMA locality is among the factors which require attention for OpenMP to give full effectiveness on Xeon 55xx or Opteron platforms.

If the data are laid out so as to promote SSE vectorization, if it is possible to arrange the first touch (e.g. when arrays are initialized) to be done by an OpenMP parallel loop of the same structure as the working loops, so that access to each section of the array is always from the same processor, and sharing of cache lines between processors is minimized, that should take care of it. Needless to say, this may be difficult to arrange in an existing application which was not designed for locality.