I am trying to measure bandwidth on a HSW i7-4770K desktop machine. Since this is not a Xeon part, pcm-memory.x cannot be used to report bandwidth and I had to resort to using pcm.x. As previously noted, pcm.x only reports # bytes read and written to memory for each sampling interval. However, when I try to convert that to memory bandwidth, I am faced with a bunch of issues.
1) I'd given a sampling frequency of 1ms when running pcm.x for bandwidth plots. However, I notice that not every sample taken by pcm.x is at a 1ms interval - the software thread goes to sleep every ms, but if I look at # ACYC, some samples are 2ms apart, and others are 5ms apart. Why is this?
2) Also, the ACYC that comes as part of the system-level output in the csv, is that the # cycles summed across all threads? If my system has 8 threads, can I safely divide that number by 8 to get a wall-clock # cycles, or is the math there more complex? I ask this because otherwise, this number seems too big!
3) What is the best way to compute memory bandwidth from the output that pcm.x reports? Can I use the formula memory_bandwidth = (read_bytes + write_bytes) / (AFREQ * (1/nominal_frequency) * ACYC * (100/time_in_C0))?
4) I am also having problems with repeatability. When I run the exact same program 5 times back-to-back, while I get similar bandwidth charts (based on assumptions #2, and #3 above) in most of the runs, at least one of the runs throws a curve-ball and shows something else! Is there any consensus on best practices to measure a dependable number? How does vtunes account for such variability when measuring memory bandwidth with PCM, for example?
Thanks in advance for all your help!
I will try to answer your questions. See also this blog.
1) PCM hit the limits of the user-mode Windows API (Sleep) call. Windows does not guarantee the delay with such (ms) precision.
2) At system level ACYC is summed across all logical cores on the system. Since ACYC counts only when the core is active you can not use it alone to calculate the wall clock time. Please use the TIME metric for this purpose.
3) Please try (read_bytes + write_bytes)/TIME/nominal_frequency
4) Please try the new formulas and larger sampling periods