Intel® Moderncode for Parallel Architectures
Support for developing parallel programming applications on Intel® Architecture.
Announcements
FPGA community forums and blogs have moved to the Altera Community. Existing Intel Community members can sign in with their current credentials.

Theoretical Maximum Throughput on Xeon 5500 Nehalem

gallomimia
Beginner
805 Views
Hi guys. Lots of great people on this forum, so I'm certain someone can help me answer this question. What I'm not sure of is if it has been discussed before, or if I'm even in the right forum.

Obviously my application will be multithreaded to take advantage of an 8 Core system. It will be coded in-house, and take advantage of every bit of processing power the Nehalem can muster. I intend to run this application on a single rackmount server.

My question will help me find the answer to "how much RAM is enough?"

I wish to know how much data this rig could chew up per second. A 2.93GHz clockspeed is the candidate, but let's keep it scalable.

My formula is as follows:

Clockspeed * number of cores * number of CPUs * number of instructions per cycle * size of operand * number of operands per operation

Add to this the SIMD vector unit, which I hope to use as much as possible.

Clockspeed * number of cores * number of CPUs * number of instructions per cycle * size of vector * number of operands per operation

In nearly all cases the operands will be two.
The vector is 256 bits I believe. Please correct me if I'm wrong.
Clockspeed, number of cores, and number of CPUs are all easily answered. 2 940 000 000 * 4 * 2
The instructions per cycle is throwing me for a loop. With Hyperthreading is it exactly two? What factors can affect this number?

My final question is, can the SIMD vector unit be engaged in the same cycle as the hyperthreading integer and/or floating point units? How many SIMD instructions per cycle?

All of this information will help me get a rough estimate of how much data the processors in this 8 core box can run through. Dual CPU Nehalems today can handle an aweful lot of memory, with the memory modules alone costing 10 thousand dollars. I need to know if it's required.

Thanks in advance for any information you can provide to help me. I'll be certain to use the forum's points system if your reply warrants such.
0 Kudos
2 Replies
TimP
Honored Contributor III
805 Views
Quoting - gallomimia
The vector is 256 bits I believe.
The instructions per cycle is throwing me for a loop. With Hyperthreading is it exactly two? What factors can affect this number?

My final question is, can the SIMD vector unit be engaged in the same cycle as the hyperthreading integer and/or floating point units? How many SIMD instructions per cycle?

All of this information will help me get a rough estimate of how much data the processors in this 8 core box can run through. Dual CPU Nehalems today can handle an aweful lot of memory, with the memory modules alone costing 10 thousand dollars.
The first Xeon with 256-bit registers is "Sandy Bridge;" no prototypes exist yet, only the emulators.
I'll start with peak SIMD arithmetic instructions per cycle. The number is 2, provided they use different execution units (typically 1 multiply, 1 add), and enough independent operations can be found to keep the pipelines full. The multiply pipeline would accomodate at least 5 operations in flight. HyperThreading doesn't change these numbers; ideally it might help keep the pipelines full.
The SIMD units are the floating point units. A scalar floating point operation uses the same unit as the corresponding parallel operation. This hasn't changed since the first SSE processors. The scalar arithmetic units are separate and run in parallel with SIMD. The advertised maximum issue width of the CPU is 5 instructions, same as Core 2. Of course, that number has little practical use, as code which attempts to issue branches on every cycle, for example, would never run at satisfactory speed for a significant number of cycles.
On models such as you mention, the core speed is significantly higher than the uncore and QPI speed, so even the last level cache can't deliver operands at the rate they are consumed by the CPU. Even under ideal circumstances, a data movement rate of 2 128 bit operands per cycle is possible only for data resident in DCU (1st level cache), and DTLB miss invalidates both 1st and 2nd level cache.
Memory intensive applications usually are limited by memory bandwidth, with the peak data transfer rate prominently advertised for each CPU, achieved only with just one DIMM per channel and with Non Uniform Memory optimization. As you hint, the price of 4GB DIMMs requires a serious consideration of financial trade-off. One OEM declined initially to show the price of a 12x2GB DIMM configuration in their catalog, although that is one of the more frequently required choices.
Most existing applications run well with a similar amount of RAM as previous CPUs required. On account of NUMA, full performance requires each CPU to have enough RAM, so uneven usage might require more RAM on NUMA than non-NUMA.
The current CPU hasn't changed much from the predecessor (Harpertown); the major difference is in the memory and cache organization.
Solid State Disks are recommended for certain applications with high memory requirements. A single solid state disk delivers nearly the bandwidth of a 4 rotating disk RAID0, although the capacity is far less. I ran a benchmark recently in which 36GB DDR3-800 plus a single SSD gave good performance for a job of a type which previously required 96GB RAM (substituting a traditional out-of-core solver for an in-core solver).
0 Kudos
gaston-hillar
Valued Contributor I
805 Views
Quoting - gallomimia
Hi guys. Lots of great people on this forum, so I'm certain someone can help me answer this question. What I'm not sure of is if it has been discussed before, or if I'm even in the right forum.

Obviously my application will be multithreaded to take advantage of an 8 Core system. It will be coded in-house, and take advantage of every bit of processing power the Nehalem can muster. I intend to run this application on a single rackmount server.

My question will help me find the answer to "how much RAM is enough?"

I wish to know how much data this rig could chew up per second. A 2.93GHz clockspeed is the candidate, but let's keep it scalable.

My formula is as follows:

Clockspeed * number of cores * number of CPUs * number of instructions per cycle * size of operand * number of operands per operation

Add to this the SIMD vector unit, which I hope to use as much as possible.

Clockspeed * number of cores * number of CPUs * number of instructions per cycle * size of vector * number of operands per operation

In nearly all cases the operands will be two.
The vector is 256 bits I believe. Please correct me if I'm wrong.
Clockspeed, number of cores, and number of CPUs are all easily answered. 2 940 000 000 * 4 * 2
The instructions per cycle is throwing me for a loop. With Hyperthreading is it exactly two? What factors can affect this number?

My final question is, can the SIMD vector unit be engaged in the same cycle as the hyperthreading integer and/or floating point units? How many SIMD instructions per cycle?

All of this information will help me get a rough estimate of how much data the processors in this 8 core box can run through. Dual CPU Nehalems today can handle an aweful lot of memory, with the memory modules alone costing 10 thousand dollars. I need to know if it's required.

Thanks in advance for any information you can provide to help me. I'll be certain to use the forum's points system if your reply warrants such.

Hi gallomimia,

tim18 has been very clear. Thus, I'm just adding a simple point of view.
Taking into account your goal, finding out "how much RAM is enough?", I think that you should consider the "sustainable theoretical maximum throughput".
As the theoretical maximum throughput won't be available all the time, the "sustainable theoretical maximum throughput" will bring you more realistic information.
Take into account the specifications from hard drive manufacturers or SSD manufacturers. They offer the theoretical maximum throughput from the buffer and then the sustainable maximum throughput.

Again, this is not an answer to your post. Just, an opinion. :)
0 Kudos
Reply