- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi all,
I am trying to use VTune Amplifier (Linux version) to profile memory access latency. I was using it to get familiar with it by profiling a toy program that just loads a big array of data. I use the command line version like this.
amplxe-cl -collect-with runsa -knob event-config=MEM_TRANS_RETIRED.LOAD_LATENCY_GT_32,MEM_TRANS_RETIRED.LOAD_LATENCY_GT_64 ./load The result I get is the following.
============================================================================
CPU
---
Parameter r000runsa
----------------- -------------------------------
Name Intel(R) Xeon(R) E5v2 processor
Frequency 2394229995
Logical CPU Count 48
Summary
-------
Elapsed Time: 7.757
CPU Usage: 1.000
Event summary
-------------
Hardware Event Type Hardware Event Count:Self Hardware Event Sample Count:Self Events Per Sample
------------------------------------ ------------------------- -------------------------------- -----------------
CPU_CLK_UNHALTED.REF_TSC 18538027807 9269 2000003
MEM_TRANS_RETIRED.LOAD_LATENCY_GT_32 0 0 100007
MEM_TRANS_RETIRED.LOAD_LATENCY_GT_64 24036 6 2003
amplxe: Executing actions 100 % done
=======================================================================
From the explanation of the MEM_TRANS_RETIRED.LOAD_LATENCY_GT_* events, the count of *_GT_32 must be greater that *_GT_64. In this case it is not, and this behavior is reproducible.
I checked the errata published at the specification update and stumbled upon the paragraph BT241 which mentions that "The affected events may undercount, resulting in inaccurate memory profiles" and the list of events contains MEM_TRANS_RETIRED.LOAD_LATENCY.
Can somebody explain why the count of MEM_TRANS_RETIRED.LOAD_LATENCY_GT_32 is less than MEM_TRANS_RETIRED.LOAD_LATENCY_GT_64 please?
Thank you,
Best Regards, ARam
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hardware Event Type Hardware Event Count:Self Hardware Event Sample Count:Self Events Per Sample
------------------------------------ ------------------------- -------------------------------- -----------------
CPU_CLK_UNHALTED.REF_TSC 18538027807 9269 2000003
MEM_TRANS_RETIRED.LOAD_LATENCY_GT_32 0 0 100007
MEM_TRANS_RETIRED.LOAD_LATENCY_GT_64 24036 6 2003
Is it possible due to bigger SAV of MEM_TRANS_RETIRED.LOAD_LATENCY_GT_32? It only has 6 samples for MEM_TRANS_RETIRED.LOAD_LATENCY_GT_64, 0 sample for MEM_TRANS_RETIRED.LOAD_LATENCY_GT_32.
Recommend to try:
amplxe-cl -collect-with runsa -knob event-config=MEM_TRANS_RETIRED.LOAD_LATENCY_GT_32:sa=2000,MEM_TRANS_RETIRED.LOAD_LATENCY_GT_64:sa=2000 ./load
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Peter thank you for your response.
I wasn't aware about this sample-after-value parameter. A high default SAV number indeed explains why GT_32 is 0.
I tried your recommendation and the results are:
Event summary
-------------
Hardware Event Type Hardware Event Count:Self Hardware Event Sample Count:Self Events Per Sample
------------------------------------ ------------------------- -------------------------------- -----------------
CPU_CLK_UNHALTED.REF_TSC 19358029037 9679 2000003
MEM_TRANS_RETIRED.LOAD_LATENCY_GT_32 88000 22 2000
MEM_TRANS_RETIRED.LOAD_LATENCY_GT_64 12000 3 2000
amplxe: Executing actions 100 % done
Much better. However I would expect the total number of events to be samples * events_per_sample. However the number I get
is two times more. Why is that?
Thanks.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I explained to you why MEM_TRANS_RETIRED.LOAD_LATENCY_GT_32 count was zero, due to its SAV = 100007, no sample was captured - it didn't mean event didn't occur...
You have good question - why their counts = 2 * SAV * samples. I think that the reason was - sometime two events occurred at same time, but VTune can only record one event at a time. If you profile their events separately - you will get "counts = SAV * samples".
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Peter,
I wrote
"A high default SAV number indeed explains why GT_32 is 0."
I tried to say that I understood your explanation in your first post. I apologize for the misscommunication,
my English skills are not that good.
About counts = 2 * SAV * samples, yes you are correct, If I profile only one LATENCY event then
counts = SAV * samples. To be more precise counts = NUM_OF_LATENCY_EVENTS * SAV * samples,
but the end result (total counts) stays (more or less) the same when using 1 2 or 3 events, simultaneously,
so the end result is accurate.
Thanks again, for your answers.
Regards,
Aram
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Aram
How many cores are in your system? Each core will generate this sample, after the SAV number of events.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello MrAnderson,
the system is a two socket IVB-EP machine. Each package has 12 physical cores. For this experiment
I have HT enabled so the visible processors are 48 in total.
I can't see a correlation with the number of available cores on my system and the following
formula
COUNTS = NUM_OF_LATENCY_EVENTS * SAV * SAMPLES.
Additionally the profiled program was single threaded. I think the explanation is related
with mechanism that MEM_TRANS_RETIRED.LOAD_LATENCY_GT_* events are collected.
Currently I'm reading documentation to figure this out.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Aram:
Another thought, and I just haven't taken the time to check, but you might check the documentation (Software Development Manual) for the exact processor family. Sometimes, the hardware counters are known to "double count". This is something out of VTune Amplifier XE's control. Also, have checked out the Tuning Guides? There might be some guidance wrt these counters in there.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
thank you for the pointers you provide I am checking them out.
In the related article from the above post, the author claims that only one LATENCY event can be sampled at a given time period, although the explanation for this limitation is not clear to me yet. If I collect LATENCY_GT_4 and LATENCY_GT_64 at a given time and a load instruction with 100 cycles latency is encountered, it is perfectly reasonable to me that both GT_4 and GT_64 must be incremented. I tried few test by using single event and multiple events (2-3) I couldn't find any discrepancy.
Cheers
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page