memory reads in my program. The data may or
may not be in the cache memories.
In a typical Intel system, say Core 2 Duo, what is
the number of reads that can be done out of order.
To be more specific, I need to know the typical window size
in which out of order loading of data is possible.
Any reference material for this would be of great help
The max possible outstanding loads and stores can be found in the Optimization Guide.
Refer to Intel 64 and IA-32 Architectures Optimization Reference Manual, http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-optimizati... section 3.6.1:
"Software can maximize memory performance by not exceeding the issue or buffering limitations of the machine. In the Intel Core microarchitecture, only 20 stores and 32 loads may be in flight at once. In Intel microarchitecture code name Nehalem, there are 32 store buffers and 48 load buffers. Since only one load can issue per cycle, algorithms which operate on two arrays are constrained to one operation every other cycle unless you use programming tricks to reduce the amount of memory usage."
Also see http://www.realworldtech.com/page.cfm?ArticleID=RWT040208182719&p=7 for Core2 and Nehalem comparison.
If you want to see the actual number of outstanding misses per clocktick on Core2 you can use VTune to collect Memory read latency (number of clockticks per miss).
Read Latency = BUS_REQUEST_OUTSTANDING.SELF / (BUS_TRANS_BRD.SELF-BUS_TRANS_IFETCH.SELF)
and the number of oustanding read misses/clocktick:
Read Misses/clocktick = BUS_REQUEST_OUTSTANDING.SELF / CPU_CLK_UNHALTED.CORE
I forgot to mention that the 'Read latency' in my previous reply will reflect the latency of each read.
How does the latency reported by the counters compare to the latency reported by a latency program (a load to use, linked list, pretty standard latency test)?
This counter latency will (unless prefetchers are disabled in the bios) probably begreater than the latency the program reports.
The actual latency reported depends on the stride. For instance, if a stride is 64 bytes and the prefetchers are enabled, the prefetchers will request the data before the load instruction takes place. This reduces the effective latency seen by the latency program.
Here is a table of results for sandybridge using my own latency tester program and the uncore counters UNC_IMPH_CBO_TRK_OCCUPANCY.ALL and UNC_IMPH_CBO_TRK_REQUESTS.ALL, stride 64 bytes, 40MB array size. The effective latency is cntr_latency/load_outstanding.
81.65 8.109 Prog_latency(ns)
70.41 49.49 cntr_latency(ns)
0.868 6.137 Load_outstanding(loads/cycle)
81.11 8.064 Effective_latency(ns)
So, just to recap, you can compute the effective latency (the latency the latency tester program actually sees) from the counter latency and loads oustanding per clock.