Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.

write only memory bandwidth

zachary_w_
Beginner
2,249 Views

Hello, I've got a W3670 CPU with 6GB of DDR3-1066, the intel ARK page for this CPU says memory bandwidth is 25.6 GB/s. Should I be able to reach this rate doing writes only, or can it only be reached with simulatenous reads and writes?

I have a small test program that measure memory bandwidth by the time taken to memset many large buffers from many threads. This test can reach approx. 13 GB/s. If I change to memcpy, the read+write rate becomes approx. 18 GB/s.

There is a similar result to mine on this page http://www.oempcworld.com/support/Memory_Architecture.html using the Everest test program.

thanks

zach

0 Kudos
1 Solution
McCalpinJohn
Honored Contributor III
2,249 Views
Well, I spent most of the week running tests with all of the available performance counters and modeling the performance in a bunch of different ways, and I have not found any clear indication of where the bottleneck might be.... The single thread streaming store performance looks to be limited by Line Fill Buffer occupancy. Each core has 10 Line Fill Buffers. Sometimes one is reserved for special situations to avoid deadlock scenarios, so I will assume 9 are available. If I further assume that servicing a streaming store requires that the buffer be occupied for about the same amount of time as a non-prefetchable read (~67 ns), then I would expect a concurrency-limited bandwidth of 9 buffers * 64 Bytes/buffer / 67 ns = 8.6 GB/s -- not too far from the 7.8 GB/s that I measure. Open page performance with a single thread is good, with about 41 Write CAS operations (cache line writes) for every page open (ACTIVATE) operation. For 4KiB pages, perfect behavior would be 64 Write CAS operations for each page open (assuming that consecutive 4KiB virtual pages are mapped to random memory banks). As you add cores, one thing that happens is that you start running into more memory bank conflicts, as the six store streams compete for the open banks on the DRAMs. The page conflict rate goes up by a factor of 200 on my Westmere EP system when running six threads (compared to one thread). These page conflicts cause DRAM pages to be closed and re-opened, resulting in a 7x increase in the number of page open operations (DRAM ACTIVATE commands). The open page performance is not yet pathological, however, with an average of almost 6 Write CAS operations per page open (ACTIVATE). I tried lots of modeling approaches, and came up with only one that gives results with the right order of magnitude. If I take the time required for 6 cores to transfer all the data and then subtract off the time that would be required at full bandwidth, I am left with a "stall time" for the run. Dividing that "stall time" by the number of page open (ACTIVATE) operations gives 53 ns of "stall time" for every page open. This is very close to the specification of 55 ns for the bank cycle time for the DRAMs in the machine. I am surprised to see the full bank cycle time "visible" for every ACTIVATE command, though >70% of the ACTIVATE commands are directly associated with page conflicts, and those tend to have the most exposed latency. Even with lots of ACTIVATE commands directly associated with bank conflicts, I would have expected concurrency across the two DRAM ranks to hide more of the bank cycle time latency.... I think it is time to declare "defeat" on this analysis and move on....

View solution in original post

0 Kudos
10 Replies
TimP
Honored Contributor III
2,249 Views
memcpy should switch automatically to non-temporal stores, when it sees sufficiently long strings, so that you don't both read "for ownership" and write each cache line. I can't read between the lines to know where you got memcpy(). Typical recent OS libraries are optimized for the AMD platforms which performed 64-bit non-temporal writes, so there is possible further performance to be gained on a platform which supports 128-bit nontemporal writes. Intel compilers would substitute a memcpy() from their own library unless you prevent that substitution. To reach peak achievable write bandwidth, you might use non-temporal intrinsics intended for Intel platforms, or try an Intel compiler.
0 Kudos
zachary_w_
Beginner
2,249 Views
Hi Tim, sorry I didn't make my question clear. What is the theoretical write-only memory bandwidth ? The rate I reach with memset is about half the number listed on ARK. Wikipedia page for QPI says send and receive are simultaneous and the total bandwidth is calculated by adding the send and receive numbers together. I wonder if the memory controller has a similar property? http://en.wikipedia.org/wiki/Intel_QuickPath_Interconnect#QuickPath_Interconnect_frequency_specifications thanks
0 Kudos
TimP
Honored Contributor III
2,248 Views
Interesting; I hadn't heard of that Wikipedia article. Another way of putting it would be that the peak rated bandwidth is achieved by sending and receiving simultaneously, taking credit for both, so it wouldn't apply to memset(). You still didn't say which memset or memcpy you are testing. I would expect glibc or MSVC memset to be implemented similar to memcpy; probably not fully optimized for recent CPUs. Again, if you aren't using the versions from the Intel compiler library, you might check what you can do with 128-bit nontemporal intrinsics, including testing with some loop unrolling and threading.
0 Kudos
McCalpinJohn
Honored Contributor III
2,247 Views
It is important to be very clear about exactly what is being counted as "bandwidth". See http://www.cs.virginia.edu/stream/ref.html/#counting for some notes. In this case QPI is not relevant -- the DDR3 DRAM buses can drive full bandwidth in either direction, but only in one direction at a time. Reads are faster than writes (streaming stores) because the Intel hardware prefetchers works really well on reads -- they move data into the L3 or L2 caches so that the processors L1 cache misses get serviced faster. There is no analog for stores --- there is not way to "prefetch" a streaming store -- you have to wait until the store happens and then send it to memory. So reads get more effective concurrency than streaming stores. When running a single thread, you will probably find that a simple copy kernel is faster *without* streaming stores, even though without streaming stores the processor has to load both the source and the destination into the cache. In this case, the sequence of store misses *is* prefetched by the hardware prefetchers, so both the reads and the writes get the extra concurrency provided by the L2 prefetchers. The overall bandwidth utilization when using a single thread is low enough that the extra read traffic does not result in a net slowdown. The STREAM Copy kernel is about 25% faster on my Westmere EP (Xeon 5680) processors without streaming stores when using a single thread. With multiple threads the bandwidth gets high enough that the extra reads get in the way, and the STREAM Copy kernel is much faster with streaming stores (since they eliminate 1/3 of the DRAM traffic). Performance when using multiple threads is typically limited by very complex issues related to DRAM bank conflicts and DRAM bus stalls on read/write, write/read, and rank-to-rank read transitions. The memory controller tries to reorder accesses to reduce these conflicts, but as DRAM gets faster, these stalls are getting more important and more difficult to reduce.
0 Kudos
zachary_w_
Beginner
2,247 Views
Thanks Tim, John, * Sorry for slow reply, Monday was a public holiday. John: I have to think more about what you've said and put more effort into understanding "DRAM bank conflicts and DRAM bus stalls". The test program I used is attached. If you spot an error please tell me! I saw no significant difference between intel and vs2010 compilers and not much difference between memset library function and writing my own loops, even with sse intrinsics. Note the printf statements for memcpy and increment tests are not counting read and writes separately (I think this is "bcopy" style according to STREAM faq). So in the output below, double those for STREAM style numbers. For completeness, the output for a run on my machine (vs2010 x64 release build on x64 win7): Running with 6 threads, each thread operating on a buffer of size 134217728 bytes 160 times. Memset 1.288e+011 bytes in 9.832e+000 seconds, rate 1.310e+010 Bytes/second. Zero'd 3.221e+010 ULs 1.288e+011 bytes in 9.780e+000 seconds, rate 1.317e+010 Bytes/second. Zero'd 1.611e+010 ULLs 1.288e+011 bytes in 9.764e+000 seconds, rate 1.320e+010 Bytes/second. Incremented 1.288e+011 bytes in 1.461e+001 seconds, rate 8.822e+009 Bytes/second. Incremented 3.221e+010 ULs 1.288e+011 bytes in 1.412e+001 seconds, rate 9.124e+009 Bytes/second. Incremented 1.611e+010 ULLs 1.288e+011 bytes in 1.404e+001 seconds, rate 9.174e+009 Bytes/second. Memcpy'd 1.288e+011 bytes in 1.372e+001 seconds, rate 9.389e+009 Bytes/second. Zero'd using SSE 1.288e+011 bytes in 9.751e+000 seconds, rate 1.321e+010 Bytes/second. Memcpy using SSE 1.288e+011 bytes in 1.372e+001 seconds, rate 9.390e+009 Bytes/second.
0 Kudos
McCalpinJohn
Honored Contributor III
2,247 Views
Hi Zachary, These results seem pretty good -- consistent with the STREAM benchmark numbers I get and with the read-only and write-only kernels that I have tested. My Xeon 5600 system has DDR3/1333 DRAM, so some of the numbers are higher, but by less than the ratio of DRAM speed. (This is expected -- DRAM overheads are mostly fixed-time, independent of DRAM speed, so overall efficiency drops as the frequency is increased.) For example, I get 15-16 GB/s for zeroing (or filling) memory using streaming stores, which is very close to the 13.1-13.2 GB/s you got. Your "increment" kernel is essentially the same as the STREAM "Scale" kernel (when compiled with streaming stores), for which I get 20.7 GB/s, compared to the 17.6-18.3 GB/s that you get (counting reads + writes). I get the same performance on STREAM "Copy" and "Scale" once I figgered out how to tell the compiler not to recognize the "Copy" kernel and replace it with a library call. For data in DRAM, there should usually be negligible difference between packed and scalar SSE on this platform, since performance is limited by the number of outstanding cache misses, the effectiveness of the L2 prefetch engines, and the memory controller DRAM scheduling -- none of which are changed by vectorization. Sometimes I get the best results using fewer cores than are available on the chip, but the differences are only significant when the kernels have a lot more data streams than these simple examples.
0 Kudos
zachary_w_
Beginner
2,247 Views
Thanks for checking this out John, I think it's clear that the achievable write-only bandwidth is about half the theoretical max. I was hoping there would be a straightforward explanation but there doesn't seem to be one. Another bandwidth measurement tool http://zsmith.co/bandwidth.html shows similar results. In the commentary section he says "7. Main memory is slower to write than to read. This is just the nature of DRAM. It takes time to charge or discharge the capacitor that is in each DRAM memory cell whereas reading it is much faster." I'm not convinced that's the whole story. Anyway I don't have anything more to add. Thanks again for your help.
0 Kudos
McCalpinJohn
Honored Contributor III
2,250 Views
Well, I spent most of the week running tests with all of the available performance counters and modeling the performance in a bunch of different ways, and I have not found any clear indication of where the bottleneck might be.... The single thread streaming store performance looks to be limited by Line Fill Buffer occupancy. Each core has 10 Line Fill Buffers. Sometimes one is reserved for special situations to avoid deadlock scenarios, so I will assume 9 are available. If I further assume that servicing a streaming store requires that the buffer be occupied for about the same amount of time as a non-prefetchable read (~67 ns), then I would expect a concurrency-limited bandwidth of 9 buffers * 64 Bytes/buffer / 67 ns = 8.6 GB/s -- not too far from the 7.8 GB/s that I measure. Open page performance with a single thread is good, with about 41 Write CAS operations (cache line writes) for every page open (ACTIVATE) operation. For 4KiB pages, perfect behavior would be 64 Write CAS operations for each page open (assuming that consecutive 4KiB virtual pages are mapped to random memory banks). As you add cores, one thing that happens is that you start running into more memory bank conflicts, as the six store streams compete for the open banks on the DRAMs. The page conflict rate goes up by a factor of 200 on my Westmere EP system when running six threads (compared to one thread). These page conflicts cause DRAM pages to be closed and re-opened, resulting in a 7x increase in the number of page open operations (DRAM ACTIVATE commands). The open page performance is not yet pathological, however, with an average of almost 6 Write CAS operations per page open (ACTIVATE). I tried lots of modeling approaches, and came up with only one that gives results with the right order of magnitude. If I take the time required for 6 cores to transfer all the data and then subtract off the time that would be required at full bandwidth, I am left with a "stall time" for the run. Dividing that "stall time" by the number of page open (ACTIVATE) operations gives 53 ns of "stall time" for every page open. This is very close to the specification of 55 ns for the bank cycle time for the DRAMs in the machine. I am surprised to see the full bank cycle time "visible" for every ACTIVATE command, though >70% of the ACTIVATE commands are directly associated with page conflicts, and those tend to have the most exposed latency. Even with lots of ACTIVATE commands directly associated with bank conflicts, I would have expected concurrency across the two DRAM ranks to hide more of the bank cycle time latency.... I think it is time to declare "defeat" on this analysis and move on....
0 Kudos
zachary_w_
Beginner
2,247 Views
Great work Dr Bandwidth! I have some newly acquired understanding of the dram activity but can only follow along with uncertainty... I can see what I think are the relevant counters around page 54 of http://software.intel.com/sites/products/collateral/hpc/vtune/performance_analysis_guide.pdf but haven't used them. I'm hoping intel pcm can do it. What tool do you use to get this information?
0 Kudos
McCalpinJohn
Honored Contributor III
2,247 Views
To read the uncore performance counters I use a combination of scripts and in-line calls to directly read and write the corresponding MSRs. The uncore performance counter MSRs and events are described in the "Intel Architecture Software Developer's Guide, Volume 3", in the chapters on performance monitoring, performance counter events, and model-specific-registers. (Those are chapters 18, 19, and 34 in the version of the document that I am working from -- document 325384-042) The tools I use are specific to Linux (though they should work on most Linux systems). It looks like the Intel PCM utility can access all these same counters, but I have not spent much time with it. Most Linux systems support a device driver interface to read/write MSRs. The device driver is typically set up so that only the root user can access it. The tools I use in the scripts are "rdmsr" and "wrmsr" from the "msrtools-1.2" package that is available for most Linux systems. For the inline code, I just copied the relevant pieces from "rdmsr.c", particularly the "open" and "pread" statements. At this point my methodology is to program the uncore counter events using a script external to the program, then I launch the program and have have it read the counters before and after the section of interest. In my test program I leave two file descriptors for the /dev/cpu/*/msr devices open -- one for a core on chip 0 and one for a core on chip 1. Then for each of those file descriptors I read MSR 0x394 to get the free-running uncore clock counter (for that chip) and MSRs 0x3B0 through 0x3B7 to get the counts from the 8 programmable uncore performance counters (for that chip). The script is set up to program several different sets of events and run the program under test once for each set of events. For this analysis, I used six sets of events (not all of which are actually useful), but for reference, they are listed below. In each case, the performance counter event select starts with 0x0040, which simply sets the "enable" bit for the counter, and each performance counter event select register ends with the Mask and the Event number. For example $WRMSR 0x3C0 0x00400429 <-- programs MSR 0x3C0 to enable the counter, set the Mask to 04 and the Event to 29 Here is the full list that I used: # Enable uncore counters -- no harm if this is repeated $WRMSR 0x391 0x00000001000000ff # MSR_UNCORE_PERF_GLOBAL_CTRL: bit 32=enable fixed ctr, bits 7:0=enable ctrs 0-7 $WRMSR 0x395 0x0000000000000001 # MSR_UNCORE_FIXED_CTR_CTRL: bit 0=enabled uncore fixed function counter if [ $1 == "1" ]; then # set 1 $WRMSR 0x3C0 0x00400128 # Cycles all entries in high priority queue of chan 0 are occupied with isoc READ reqs $WRMSR 0x3C1 0x00400228 # Cycles all entries in high priority queue of chan 1 are occupied with isoc READ reqs $WRMSR 0x3C2 0x00400428 # Cycles all entries in high priority queue of chan 2 are occupied with isoc READ reqs $WRMSR 0x3C3 0x00400828 # Cycles all entries in high priority queue of chan 0 are occupied with isoc WRITE reqs $WRMSR 0x3C4 0x00401028 # Cycles all entries in high priority queue of chan 1 are occupied with isoc WRITE reqs $WRMSR 0x3C5 0x00402028 # Cycles all entries in high priority queue of chan 2 are occupied with isoc WRITE reqs $WRMSR 0x3C6 0x00400129 # Cycles where channel 0 has at least one READ request pending $WRMSR 0x3C7 0x00400229 # Cycles where channel 1 has at least one READ request pending elif [ $1 == "2" ]; then # set 2 $WRMSR 0x3C0 0x00400429 # Cycles where channel 2 has at least one READ request pending $WRMSR 0x3C1 0x00400829 # Cycles where channel 0 has at least one WRITE request pending $WRMSR 0x3C2 0x00401029 # Cycles where channel 1 has at least one WRITE request pending $WRMSR 0x3C3 0x00402029 # Cycles where channel 2 has at least one WRITE request pending $WRMSR 0x3C4 0x0040012F # FULL cache line writes to channel 0 $WRMSR 0x3C5 0x0040022F # FULL cache line writes to channel 1 $WRMSR 0x3C6 0x0040042F # FULL cache line writes to channel 2 $WRMSR 0x3C7 0x0040082F # PARTIAL cache line writes to channel 0 elif [ $1 == "3" ]; then # set 3 $WRMSR 0x3C0 0x0040102F # PARTIAL cache line writes to channel 1 $WRMSR 0x3C1 0x0040202F # PARTIAL cache line writes to channel 2 $WRMSR 0x3C2 0x00400160 # DRAM Page Open (ACTIVATE) commands on channel 0 $WRMSR 0x3C3 0x00400260 # DRAM Page Open (ACTIVATE) commands on channel 1 $WRMSR 0x3C4 0x00400460 # DRAM Page Open (ACTIVATE) commands on channel 2 $WRMSR 0x3C5 0x00400161 # DRAM Page Close due to idle timer timeout on channel 0 $WRMSR 0x3C6 0x00400261 # DRAM Page Close due to idle timer timeout on channel 1 $WRMSR 0x3C7 0x00400461 # DRAM Page Close due to idle timer timeout on channel 2 elif [ $1 == "4" ]; then # set 4 $WRMSR 0x3C0 0x00400162 # DRAM Page Close due to bank conflict on channel 0 $WRMSR 0x3C1 0x00400262 # DRAM Page Close due to bank conflict on channel 1 $WRMSR 0x3C2 0x00400462 # DRAM Page Close due to bank conflict on channel 2 $WRMSR 0x3C3 0x00400163 # READ CAS operations (without autoprecharge) on channel 0 $WRMSR 0x3C4 0x00400463 # READ CAS operations (without autoprecharge) on channel 1 $WRMSR 0x3C5 0x00401063 # READ CAS operations (without autoprecharge) on channel 2 $WRMSR 0x3C6 0x00400164 # WRITE CAS operations (without autoprecharge) on channel 0 $WRMSR 0x3C7 0x00400464 # WRITE CAS operations (without autoprecharge) on channel 1 elif [ $1 == "5" ]; then # set 5 $WRMSR 0x3C0 0x00401064 # WRITE CAS operations (without autoprecharge) on channel 2 $WRMSR 0x3C1 0x00400165 # DRAM REFRESH operations on channel 0 $WRMSR 0x3C2 0x00400265 # DRAM REFRESH operations on channel 1 $WRMSR 0x3C3 0x00400465 # DRAM REFRESH operations on channel 2 $WRMSR 0x3C4 0x00400166 # DRAM PRECHARGE ALL operations on channel 0 $WRMSR 0x3C5 0x00400266 # DRAM PRECHARGE ALL operations on channel 1 $WRMSR 0x3C6 0x00400466 # DRAM PRECHARGE ALL operations on channel 2 $WRMSR 0x3C7 0x00400100 # Cycles uncore global queue read tracker is full elif [ $1 == "6" ]; then # set 6 $WRMSR 0x3C0 0x00400200 # Cycles uncore global queue write tracker is full $WRMSR 0x3C1 0x00400400 # Cycles uncore global queue peer probe tracker is full $WRMSR 0x3C2 0x00400101 # Cycles uncore global queue read tracker is not empty $WRMSR 0x3C3 0x00400201 # Cycles uncore global queue write tracker is not empty $WRMSR 0x3C4 0x00400401 # Cycles uncore global queue peer probe tracker is not empty $WRMSR 0x3C5 0x00400167 # Cycles DRAM was throttled due to DRAM over-temperature $WRMSR 0x3C6 0x00400285 # Uncore cycles with at least one core unhalted and all L3 ways enabled $WRMSR 0x3C7 0x00400184 # Uncore cycles with core 0 operating in Turbo mode else echo "Error -- event set $i not defined" exit fi
0 Kudos
Reply