topic Dr. McCalpin, in Intel® Moderncode for Parallel Architectures

PCM reporting lower than expected memory read counts

Patrick_L_ — Tue, 05 May 2015 17:43:17 GMT

I have a piece of code on which I'm running PCM (Performance Counter Monitor). It is essentially the following:

uint64_t *a,*b;
a = new uint64_t[LEN];
b = new uint64_t[LEN];
for( int i=0;i<LEN;i++ ) a = b;

With LEN set to 402,653,184 (384 Mi), PCM is reporting 0.72 GB under READ and 6.30 GB under WRITE. Given that each array is 3 GiB, I would expect that both arrays would be read (since processor uses write-allocate), giving a READ of about 6 GiB. I would expect array "a" to be written back, giving a write count of 3 GiB.

Does anyone know why the read count is so low, and the write count is higher than expected?

Processor is Intel Core i7 940 (Nehalem).

Any help is appreciated.

Patrick

Make sure that the arrays are

McCalpinJohn — Tue, 05 May 2015 20:45:04 GMT

Make sure that the arrays are instantiated before you use them.

Linux handles un-instantiated addresses differently than other Unix systems. If you read an address that has not previously been written to, the OS will map the address to a "zero page" that is filled with zeros. That "zero page" will be cache-contained, so the reads of b will mostly come from the cache.

The writes to a will force those pages to be instantiated, and eventually written back from the cache. However, the code that instantiates the pages is complex and obscure, and it is very hard to understand the performance counts obtained from that code path. It is certainly possible that the code could write the pages of a to memory (e.g. zeroing the page using streaming store) before they are read back in by the code -- this would account for the doubled write traffic.

To minimize the confusion due to complex and obscure OS code, I recommend including an "initialization loop" that fill both arrays with something, then a repeated "copy loop" that copies the arrays back and forth. I sometimes set up two versions of the code -- one that copies the arrays 20 times and one that copies the arrays 10 times. Taking the difference between the counts should remove most of the confusing overhead and leave you with the memory traffic associated with the 10 extra array copies.

Dr. McCalpin,

Patrick_L_ — Tue, 05 May 2015 21:56:32 GMT

Dr. McCalpin,

Thank you for the guidance. This is very helpful. I will plan to revisit my code snippet with the modifications you've suggested.

Regards,
Patrick L.