L1D Hw Pref activity in SB

perfwise · ‎10-19-2011

Hi,

I'm trying to measure, in some detail, how my SB is hardware prefeching from the L1D. The documentation for SB is somewhat lacking, compared to detail from NH. NH had the following documentation for HW prefetches from the L1D:

http://software.intel.com/sites/products/documentation/hpc/amplifierxe/en-us/lin/ug_docs/reference/index.htm#snb/events/about_front_end_performance_tuning_events.html

while on SB it only uses 1 of the Unit masks (2) for misses. I'm measuring unit mask 1 and 4 as well which appear to work.

Can someone confirm that measuring unit mask 2 for PMC 0x4E measures ALL hardware prefetch misses to the L1D? If there's any more detail that can be provided as to whether unit mask 0x01 and 0x04 work and what they measure that's appreciated as well.

Thanks

perfwise

Patrick_F_Intel1 · ‎10-24-2011

Hello Perfwise,
Usually when an event is dropped from a chip the reason is that the event doesn't count correctly (for some case) or was not verified to work or no longer applies.
I'll check on the reason for these 2 masks.
Pat

SergeyKostrov · ‎11-26-2011

Hi Patrick,

What is the best tool to measure statistics for all cache lines onIntel's CPUs?

Best regards,
Sergey

Patrick_F_Intel1 · ‎11-26-2011

Hello Sergey,
Usually we would use VTune Amplifier (www.intel.com/software/products/vtune/ )to measure cache statistics.
Depending on what you want to measure, other tools (such as linux perf) may provide the info you need.
What specifically do you want to measure?
Pat

SergeyKostrov · ‎11-28-2011

Hi Patrick,

>>...What specifically do you want to measure?

I've added support for asoftware prefetch in a couple of algorithms with 'prefetcht1' instruction. These algorithms areworking in areal-time environmentand I'd like to confirm that a performance increasefrom ~0.5%to ~1.5% is not a "noise".

What do you think?

Best regards,
Sergey

SergeyKostrov · ‎11-28-2011

You could also review our discussion about '_mm_prefetch'at:

http://software.intel.com/en-us/forums/showthread.php?t=46284

Best regards,
Sergey

perfwise · ‎11-29-2011

Patrick,

One question. On HW_PRE_REQ.DL1_MISS:

http://software.intel.com/sites/products/documentation/hpc/amplifierxe/en-us/lin/ug_docs/reference/index.htm#snb/events/about_front_end_performance_tuning_events.html

for this event, can you verify that it increments if a HW PREF request hit upon an LFB entry previously allocated by a DEMAND REQ load/store or a previous HW PREF?

I measure the allocations into the L1D, but I want to get a good idea of what % of those allocations are associated with HW PREF activity.

Thanks..

perfwise

Patrick_F_Intel1 · ‎11-29-2011

Hello Sergey,
For variances of 0.5% to 1.5% and if this is within the measurement noise, I worry whether the additional code complexity is worth the extra performance.
While one can use events to try and measure the effectiveness of the prefetch operation, the main thing I would look at is:
Does the code run faster with the prefetch operation than without the prefetch?
If this is in the noise then I wouldn't put it in the code.

a little random thought here...
What is that saying about optimizing code?... from http://en.wikipedia.org/wiki/Program_optimization
"The First Rule of Program Optimization: Don't do it. The Second Rule of Program Optimization (for experts only!): Don't do it yet." - Michael A. Jackson
I know it sounds like something the guy on a performance optimization forum ought not be saying but the idea is that if the optimization doesn't pay off big enough and it increases code complexity then it should be avoided.

One can ask whether the prefetch is doing what you want. I would probably start by measuring the memory bandwidth of the 2 cases. I would verify that whatever bw counter I use actually counts the memory fetched using prefetcht1. I'll probably run these tests tomorrow and get back to you.
Pat

SergeyKostrov · ‎11-30-2011

>>...the additional code complexity is worth the extra performance...

Code complexity? This is how it looks like at the "core":

...

template < class T > inline _RTdeclspec_naked RTvoid HrtDataPrefetchT0( T *ptAddress )
{
_asm
{
prefetcht0[ ptAddress ]
ret
}
};

#undef CrtDataPrefetch
#define CrtDataPrefetchHrtDataPrefetchT0
...

and, there are a couple of places ( just a couple! ) where I used it like:

...
CrtDataPrefetch( ptC );
CrtDataPrefetch( ptB );
CrtDataPrefetch( ptA );
...

Nothing else.

>>...Does the code run faster with the prefetch operation than without the prefetch?..

As I told, a performance increase I measured without vTune is from 0.5% to 1.5% and, of course, it has to be confirmed. Unfortunately, I don't have vTune, but I could create a generictest case for aWindows platform and I wonder if you will be able to compile and evaluate itto get some numbers?

>>...The First Rule of Program Optimization: Don't do it...

Sorry about this butI don't considerthat statementseriously andI consider it as a "joke".

>>...bw counter...

What is it?

>>...I'll probably run these tests tomorrow and get back to you...

Excellent! Thank you in advance!

Best regards,
Sergey

SergeyKostrov · ‎12-05-2011

>>...I'll probably run these tests tomorrow and get back to you...

Hi Patrick,

Did you have a chance to run the tests?

Best regards,
Sergey

Patrick_F_Intel1 · ‎12-08-2011

Hello Sergey,
Sorry to delay in responding. I have a lot of things to wrap up before the end of the year.
Below is for sandybridge processors.

You can use uncore event UNC_CBO_CACHE_LOOKUP.READ_I (uncore event code 0x34, umask 0x18) to measure memory bandwidth due to prefetchnta, prefetcht* or any reads which miss the LLC.

char big_array[big_number];
char a[1];
for(i=0; i < big_number; i+= 64) { a[0] += big_array; } // so we are just fetching lines from memory.

Uncore event UNC_IMPH_CBO_TRK_REQUEST.WRITES (uncore event code 0x81, umask 0x20) counts memory read-for-ownership (RFO) operations.
RFO operations are like below.

char big_array[big_number];
char a[1]= 1;
for(i=0; i < big_number; i+= 64) { big_array = a[0]; }
Before a line can be written to, it is first brought into cache with a 'read for ownership' operation.

Uncore event UNC_IMPH_CBO_TRK_REQUEST.EVICTIONS (uncore event code 0x81, umask 0x80) counts memory writeback operations. Once the above RFO loops fills up the LLC with dirty cache lines, subsequent RFOs cause dirty lines to be evicted from the LLC.

Hope this helps.
Pat

SergeyKostrov · ‎12-09-2011

Thank you, Patrick!