Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.
29277 Discussions

Best hardware for OpenMP application

Reinaldo_Garcia
Beginner
1,073 Views

We are considering purchasing a computer to run very large parallelized applications based on OpenMP. We are looking for optimal configuration based on processor (e.g I7 vs Xeon), Cache memory, number of processors, etc.

Any suggestion or experiences you can share would be highly appreciated.

Thanks,

R

0 Kudos
9 Replies
TimP
Honored Contributor III
1,073 Views

The distinction between Core i7 and Xeon is simply the number of CPU packages. Core i7 is excellent for supporting up to 4 threads (8, if your application uses HyperThreading effectively); many support 12GB RAM routinely, or even 24GB. Most Xeon 55xx provide 8 cores, and could support 16 threads with HyperThreading, with 48GB RAM not an unusual configuration.

Within a few months, corresponding Westmere models with 12 cores will be available. At that time, for an application to be considered large, it might require the Nehalem-EX model with 4 CPU packages of 6 or 8 cores each, with 128 or 256GB RAM. For OpenMP to run well on the larger numbers of packages and cores, increased attention to memory locality issues is needed.

0 Kudos
peterklaver
Beginner
1,072 Views

For OpenMP to run well on the larger numbers of packages and cores, increased attention to memory locality issues is needed.

I hear you there. Is it reasonable to assume that each of the above configurations has its own, unique memory locality issues? Is it further reasonable to assume that these issues can be addressed through judicious setting of environment variables? Or, asked differently, is the choice of CPU configuration not especially important with respect to Open MP implementation AS LONG AS the memory locality issues are appropriately addressed? PS the new forum software still needs work; I am replying to tim18 but it says I'm quoting the OP (which I could edit manually but that's not the point).
0 Kudos
Reinaldo_Garcia
Beginner
1,073 Views

Thanks so much for your helpful insight. I have a question regarding your remark about the memory issues when using larger numbers of cores. Does it mean that the same code that runs OK on a quad core processor would need to be modified to run on a larger system with multiple processors?

Thanks again,

R

0 Kudos
TimP
Honored Contributor III
1,073 Views

If a great deal of foresight has been given to organization, or the application is simple enough, it's certainly possible that an OpenMP application designed for a quad core CPU will scale automatically, even to 4 8-core CPUs, with setting of environment variables (KMP_AFFINITY for Intel OpenMP).

On the other hand, an application which works well on a single multi-core CPU in spite of false sharing, or even latent race conditions, is likely to be a disaster already on a dual CPU.

In an example I discussed earlier, simple use of OpenMP schedule(guided) worked well to balance work between threads only up to 8 threads, but it was possible to scale effectively to at least 24 cores by writing in specific load balancing to keep each chunk local to the same core (and memory bank).

0 Kudos
Grant_H_Intel
Employee
1,073 Views

Reinaldo,

Much depends on the characteristics of your applications as well when chosing a large system. Some application domainshave very little locality and large data sets,so memory bandwidth is crucial. For these applications, Xeon processors with 3-channel DDR3 would be best to maximize memory bandwidth. Other applications have inherently high data locality, so for these the memory system is less important (2-channel DDR3 is fine), but the size of the on-chip caches (especially the largest level) should be maximized. Hope this helps a bit!

- Grant

0 Kudos
Reinaldo_Garcia
Beginner
1,073 Views

Our applications are computationally and memory intensive finite element models. Typically runs take from several hours to several days on a I7-920 processor. Parallelizing the code with OpenMP has made a huge difference on these processors, but I was wondering how much better performance we could expect using2 Xeon processors instead of on I7.

Thanks

R//G

0 Kudos
TimP
Honored Contributor III
1,073 Views
Results from http://topcrunch.org typically show Xeon 5560 (dual quad core) platforms giving 70% more performance than Core I7. On some of these applications, the 6 core CPUs give an additional 25%. Note that most results on that site are for MPI applications, which may gain more than OpenMP from additional sockets.
0 Kudos
jimdempseyatthecove
Honored Contributor III
1,073 Views

Reinaldo,

If your FE has memory bandwidth issues (as opposed to memory capacity issues), and if you wish to address this with NUMA node locality, then you must invest some time in partitioning the data such that parts are allocated in a distributed manner accross each NUMA node (and processed by HT threads from within those respective nodes). Using Structures Of Arrays layout can help too as they can take better advantage of SSE capabilities in the processor.

e.g. if your code is currently setup with objects (nodes) containing property variables, say Vec3 ::pos then for Structures Of Arrays layout you do not allocate objects (nodes), Rather you allocate arrays of your property variables (pos, vel, acc, force, ...).

However, on NUMA platform with NUMA allocation, you would pre-slice the property variable arrays by number of NUMA nodes (2, 4, 8...) and allocate those slice domains within those nodde. Then construct your processing loops such that each slice has preferential (or exclusive) processing by threads within the NUMA node.

This requires more programming work, but once you do this for one major loop, it becomes close to a cut and paste operation to setup your next major loop

Some of my FE runs (Space Elevator simulations) would take weeks to complete.

Jim Dempsey

0 Kudos
TimP
Honored Contributor III
1,073 Views

Yes, NUMA locality is among the factors which require attention for OpenMP to give full effectiveness on Xeon 55xx or Opteron platforms.

If the data are laid out so as to promote SSE vectorization, if it is possible to arrange the first touch (e.g. when arrays are initialized) to be done by an OpenMP parallel loop of the same structure as the working loops, so that access to each section of the array is always from the same processor, and sharing of cache lines between processors is minimized, that should take care of it. Needless to say, this may be difficult to arrange in an existing application which was not designed for locality.

0 Kudos
Reply