- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Usually when an event is dropped from a chip the reason is that the event doesn't count correctly (for some case) or was not verified to work or no longer applies.
I'll check on the reason for these 2 masks.
Pat
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
What is the best tool to measure statistics for all cache lines onIntel's CPUs?
Best regards,
Sergey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Usually we would use VTune Amplifier (www.intel.com/software/products/vtune/ )to measure cache statistics.
Depending on what you want to measure, other tools (such as linux perf) may provide the info you need.
What specifically do you want to measure?
Pat
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>...What specifically do you want to measure?
I've added support for asoftware prefetch in a couple of algorithms with 'prefetcht1' instruction. These algorithms areworking in areal-time environmentand I'd like to confirm that a performance increasefrom ~0.5%to ~1.5% is not a "noise".
What do you think?
Best regards,
Sergey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
http://software.intel.com/en-us/forums/showthread.php?t=46284
Best regards,
Sergey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello Sergey,
For variances of 0.5% to 1.5% and if this is within the measurement noise, I worry whether the additional code complexity is worth the extra performance.
While one can use events to try and measure the effectiveness of the prefetch operation, the main thing I would look at is:
Does the code run faster with the prefetch operation than without the prefetch?
If this is in the noise then I wouldn't put it in the code.
a little random thought here...
What is that saying about optimizing code?... from http://en.wikipedia.org/wiki/Program_optimization
"The First Rule of Program Optimization: Don't do it. The Second Rule of Program Optimization (for experts only!): Don't do it yet." - Michael A. Jackson
I know it sounds like something the guy on a performance optimization forum ought not be saying but the idea is that if the optimization doesn't pay off big enough and it increases code complexity then it should be avoided.
One can ask whether the prefetch is doing what you want. I would probably start by measuring the memory bandwidth of the 2 cases. I would verify that whatever bw counter I use actually counts the memory fetched using prefetcht1. I'll probably run these tests tomorrow and get back to you.
Pat
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Code complexity? This is how it looks like at the "core":
...
template < class T > inline _RTdeclspec_naked RTvoid HrtDataPrefetchT0( T *ptAddress )
{
_asm
{
prefetcht0[ ptAddress ]
ret
}
};
#undef CrtDataPrefetch
#define CrtDataPrefetchHrtDataPrefetchT0
...
and, there are a couple of places ( just a couple! ) where I used it like:
...
CrtDataPrefetch( ptC );
CrtDataPrefetch( ptB );
CrtDataPrefetch( ptA );
...
Nothing else.
>>...Does the code run faster with the prefetch operation than without the prefetch?..
As I told, a performance increase I measured without vTune is from 0.5% to 1.5% and, of course, it has to be confirmed. Unfortunately, I don't have vTune, but I could create a generictest case for aWindows platform and I wonder if you will be able to compile and evaluate itto get some numbers?
>>...The First Rule of Program Optimization: Don't do it...
Sorry about this butI don't considerthat statementseriously andI consider it as a "joke".
>>...bw counter...
What is it?
>>...I'll probably run these tests tomorrow and get back to you...
Excellent! Thank you in advance!
Best regards,
Sergey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Patrick,
Did you have a chance to run the tests?
Best regards,
Sergey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sorry to delay in responding. I have a lot of things to wrap up before the end of the year.
Below is for sandybridge processors.
You can use uncore event UNC_CBO_CACHE_LOOKUP.READ_I (uncore event code 0x34, umask 0x18) to measure memory bandwidth due to prefetchnta, prefetcht* or any reads which miss the LLC.
char big_array[big_number];
char a[1];
for(i=0; i < big_number; i+= 64) { a[0] += big_array; } // so we are just fetching lines from memory.
Uncore event UNC_IMPH_CBO_TRK_REQUEST.WRITES (uncore event code 0x81, umask 0x20) counts memory read-for-ownership (RFO) operations.
RFO operations are like below.
char big_array[big_number];
char a[1]= 1;
for(i=0; i < big_number; i+= 64) { big_array = a[0]; }
Before a line can be written to, it is first brought into cache with a 'read for ownership' operation.
Uncore event UNC_IMPH_CBO_TRK_REQUEST.EVICTIONS (uncore event code 0x81, umask 0x80) counts memory writeback operations. Once the above RFO loops fills up the LLC with dirty cache lines, subsequent RFOs cause dirty lines to be evicted from the LLC.
Hope this helps.
Pat
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page