"Software can maximize memory performance by not exceeding the issue or buffering limitations of the machine. In the Intel Core microarchitecture, only 20 stores and 32 loads may be in flight at once. In Intel microarchitecture code name Nehalem, there are 32 store buffers and 48 load buffers. Since only one load can issue per cycle, algorithms which operate on two arrays are constrained to one operation every other cycle unless you use programming tricks to reduce the amount of memory usage."
Also see http://www.realworldtech.com/page.cfm?ArticleID=RWT040208182719&p=7 for Core2 and Nehalem comparison.
If you want to see the actual number of outstanding misses per clocktick on Core2 you can use VTune to collect Memory read latency (number of clockticks per miss).
Read Latency = BUS_REQUEST_OUTSTANDING.SELF / (BUS_TRANS_BRD.SELF-BUS_TRANS_IFETCH.SELF)
and the number of oustanding read misses/clocktick:
Read Misses/clocktick = BUS_REQUEST_OUTSTANDING.SELF / CPU_CLK_UNHALTED.CORE
For more complete information about compiler optimizations, see our Optimization Notice.