- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
We are considering purchasing a computer to run very large parallelized applications based on OpenMP. We are looking for optimal configuration based on processor (e.g I7 vs Xeon), Cache memory, number of processors, etc.
Any suggestion or experiences you can share would be highly appreciated.
Thanks,
R
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The distinction between Core i7 and Xeon is simply the number of CPU packages. Core i7 is excellent for supporting up to 4 threads (8, if your application uses HyperThreading effectively); many support 12GB RAM routinely, or even 24GB. Most Xeon 55xx provide 8 cores, and could support 16 threads with HyperThreading, with 48GB RAM not an unusual configuration.
Within a few months, corresponding Westmere models with 12 cores will be available. At that time, for an application to be considered large, it might require the Nehalem-EX model with 4 CPU packages of 6 or 8 cores each, with 128 or 256GB RAM. For OpenMP to run well on the larger numbers of packages and cores, increased attention to memory locality issues is needed.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
For OpenMP to run well on the larger numbers of packages and cores, increased attention to memory locality issues is needed.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks so much for your helpful insight. I have a question regarding your remark about the memory issues when using larger numbers of cores. Does it mean that the same code that runs OK on a quad core processor would need to be modified to run on a larger system with multiple processors?
Thanks again,
R
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If a great deal of foresight has been given to organization, or the application is simple enough, it's certainly possible that an OpenMP application designed for a quad core CPU will scale automatically, even to 4 8-core CPUs, with setting of environment variables (KMP_AFFINITY for Intel OpenMP).
On the other hand, an application which works well on a single multi-core CPU in spite of false sharing, or even latent race conditions, is likely to be a disaster already on a dual CPU.
In an example I discussed earlier, simple use of OpenMP schedule(guided) worked well to balance work between threads only up to 8 threads, but it was possible to scale effectively to at least 24 cores by writing in specific load balancing to keep each chunk local to the same core (and memory bank).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Reinaldo,
Much depends on the characteristics of your applications as well when chosing a large system. Some application domainshave very little locality and large data sets,so memory bandwidth is crucial. For these applications, Xeon processors with 3-channel DDR3 would be best to maximize memory bandwidth. Other applications have inherently high data locality, so for these the memory system is less important (2-channel DDR3 is fine), but the size of the on-chip caches (especially the largest level) should be maximized. Hope this helps a bit!
- Grant
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Our applications are computationally and memory intensive finite element models. Typically runs take from several hours to several days on a I7-920 processor. Parallelizing the code with OpenMP has made a huge difference on these processors, but I was wondering how much better performance we could expect using2 Xeon processors instead of on I7.
Thanks
R//G
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Reinaldo,
If your FE has memory bandwidth issues (as opposed to memory capacity issues), and if you wish to address this with NUMA node locality, then you must invest some time in partitioning the data such that parts are allocated in a distributed manner accross each NUMA node (and processed by HT threads from within those respective nodes). Using Structures Of Arrays layout can help too as they can take better advantage of SSE capabilities in the processor.
e.g. if your code is currently setup with objects (nodes) containing property variables, say Vec3 ::pos then for Structures Of Arrays layout you do not allocate objects (nodes), Rather you allocate arrays of your property variables (pos, vel, acc, force, ...).
However, on NUMA platform with NUMA allocation, you would pre-slice the property variable arrays by number of NUMA nodes (2, 4, 8...) and allocate those slice domains within those nodde. Then construct your processing loops such that each slice has preferential (or exclusive) processing by threads within the NUMA node.
This requires more programming work, but once you do this for one major loop, it becomes close to a cut and paste operation to setup your next major loop
Some of my FE runs (Space Elevator simulations) would take weeks to complete.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yes, NUMA locality is among the factors which require attention for OpenMP to give full effectiveness on Xeon 55xx or Opteron platforms.
If the data are laid out so as to promote SSE vectorization, if it is possible to arrange the first touch (e.g. when arrays are initialized) to be done by an OpenMP parallel loop of the same structure as the working loops, so that access to each section of the array is always from the same processor, and sharing of cache lines between processors is minimized, that should take care of it. Needless to say, this may be difficult to arrange in an existing application which was not designed for locality.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page