Intel® Moderncode for Parallel Architectures
Support for developing parallel programming applications on Intel® Architecture.

PCM reporting lower than expected memory read counts

Patrick_L_
Beginner
743 Views

I have a piece of code on which I'm running PCM (Performance Counter Monitor). It is essentially the following:

uint64_t *a,*b;
a = new uint64_t[LEN];
b = new uint64_t[LEN];
for( int i=0;i<LEN;i++ ) a = b;

With LEN set to 402,653,184 (384 Mi), PCM is reporting 0.72 GB under READ and 6.30 GB under WRITE. Given that each array is 3 GiB, I would expect that both arrays would be read (since processor uses write-allocate), giving a READ of about 6 GiB. I would expect array "a" to be written back, giving a write count of 3 GiB.

Does anyone know why the read count is so low, and the write count is higher than expected?

Processor is Intel Core i7 940 (Nehalem).

Any help is appreciated.

Patrick

0 Kudos
1 Solution
McCalpinJohn
Honored Contributor III
743 Views

Make sure that the arrays are instantiated before you use them.  

Linux handles un-instantiated addresses differently than other Unix systems.  If you read an address that has not previously been written to, the OS will map the address to a "zero page" that is filled with zeros.  That "zero page" will be cache-contained, so the reads of b will mostly come from the cache.

The writes to a will force those pages to be instantiated, and eventually written back from the cache.  However, the code that instantiates the pages is complex and obscure, and it is very hard to understand the performance counts obtained from that code path.  It is certainly possible that the code could write the pages of a to memory (e.g. zeroing the page using streaming store) before they are read back in by the code  -- this would account for the doubled write traffic. 

To minimize the confusion due to complex and obscure OS code, I recommend including an "initialization loop" that fill both arrays with something, then a repeated "copy loop" that copies the arrays back and forth.   I sometimes set up two versions of the code -- one that copies the arrays 20 times and one that copies the arrays 10 times.  Taking the difference between the counts should remove most of the confusing overhead and leave you with the memory traffic associated with the 10 extra array copies.

View solution in original post

0 Kudos
2 Replies
McCalpinJohn
Honored Contributor III
744 Views

Make sure that the arrays are instantiated before you use them.  

Linux handles un-instantiated addresses differently than other Unix systems.  If you read an address that has not previously been written to, the OS will map the address to a "zero page" that is filled with zeros.  That "zero page" will be cache-contained, so the reads of b will mostly come from the cache.

The writes to a will force those pages to be instantiated, and eventually written back from the cache.  However, the code that instantiates the pages is complex and obscure, and it is very hard to understand the performance counts obtained from that code path.  It is certainly possible that the code could write the pages of a to memory (e.g. zeroing the page using streaming store) before they are read back in by the code  -- this would account for the doubled write traffic. 

To minimize the confusion due to complex and obscure OS code, I recommend including an "initialization loop" that fill both arrays with something, then a repeated "copy loop" that copies the arrays back and forth.   I sometimes set up two versions of the code -- one that copies the arrays 20 times and one that copies the arrays 10 times.  Taking the difference between the counts should remove most of the confusing overhead and leave you with the memory traffic associated with the 10 extra array copies.

0 Kudos
Patrick_L_
Beginner
743 Views

Dr. McCalpin,

Thank you for the guidance. This is very helpful. I will plan to revisit my code snippet with the modifications you've suggested.

Regards,
Patrick L.

0 Kudos
Reply