- Marquer comme nouveau
- Marquer
- S'abonner
- Sourdine
- S'abonner au fil RSS
- Surligner
- Imprimer
- Signaler un contenu inapproprié
Hi all,
I am trying to use VTune Amplifier (Linux version) to profile memory access latency. I was using it to get familiar with it by profiling a toy program that just loads a big array of data. I use the command line version like this.
amplxe-cl -collect-with runsa -knob event-config=MEM_TRANS_RETIRED.LOAD_LATENCY_GT_32,MEM_TRANS_RETIRED.LOAD_LATENCY_GT_64 ./load The result I get is the following.
============================================================================
CPU
---
Parameter r000runsa
----------------- -------------------------------
Name Intel(R) Xeon(R) E5v2 processor
Frequency 2394229995
Logical CPU Count 48
Summary
-------
Elapsed Time: 7.757
CPU Usage: 1.000
Event summary
-------------
Hardware Event Type Hardware Event Count:Self Hardware Event Sample Count:Self Events Per Sample
------------------------------------ ------------------------- -------------------------------- -----------------
CPU_CLK_UNHALTED.REF_TSC 18538027807 9269 2000003
MEM_TRANS_RETIRED.LOAD_LATENCY_GT_32 0 0 100007
MEM_TRANS_RETIRED.LOAD_LATENCY_GT_64 24036 6 2003
amplxe: Executing actions 100 % done
=======================================================================
From the explanation of the MEM_TRANS_RETIRED.LOAD_LATENCY_GT_* events, the count of *_GT_32 must be greater that *_GT_64. In this case it is not, and this behavior is reproducible.
I checked the errata published at the specification update and stumbled upon the paragraph BT241 which mentions that "The affected events may undercount, resulting in inaccurate memory profiles" and the list of events contains MEM_TRANS_RETIRED.LOAD_LATENCY.
Can somebody explain why the count of MEM_TRANS_RETIRED.LOAD_LATENCY_GT_32 is less than MEM_TRANS_RETIRED.LOAD_LATENCY_GT_64 please?
Thank you,
Best Regards, ARam
Lien copié
- Marquer comme nouveau
- Marquer
- S'abonner
- Sourdine
- S'abonner au fil RSS
- Surligner
- Imprimer
- Signaler un contenu inapproprié
Hardware Event Type Hardware Event Count:Self Hardware Event Sample Count:Self Events Per Sample
------------------------------------ ------------------------- -------------------------------- -----------------
CPU_CLK_UNHALTED.REF_TSC 18538027807 9269 2000003
MEM_TRANS_RETIRED.LOAD_LATENCY_GT_32 0 0 100007
MEM_TRANS_RETIRED.LOAD_LATENCY_GT_64 24036 6 2003
Is it possible due to bigger SAV of MEM_TRANS_RETIRED.LOAD_LATENCY_GT_32? It only has 6 samples for MEM_TRANS_RETIRED.LOAD_LATENCY_GT_64, 0 sample for MEM_TRANS_RETIRED.LOAD_LATENCY_GT_32.
Recommend to try:
amplxe-cl -collect-with runsa -knob event-config=MEM_TRANS_RETIRED.LOAD_LATENCY_GT_32:sa=2000,MEM_TRANS_RETIRED.LOAD_LATENCY_GT_64:sa=2000 ./load
- Marquer comme nouveau
- Marquer
- S'abonner
- Sourdine
- S'abonner au fil RSS
- Surligner
- Imprimer
- Signaler un contenu inapproprié
Peter thank you for your response.
I wasn't aware about this sample-after-value parameter. A high default SAV number indeed explains why GT_32 is 0.
I tried your recommendation and the results are:
Event summary
-------------
Hardware Event Type Hardware Event Count:Self Hardware Event Sample Count:Self Events Per Sample
------------------------------------ ------------------------- -------------------------------- -----------------
CPU_CLK_UNHALTED.REF_TSC 19358029037 9679 2000003
MEM_TRANS_RETIRED.LOAD_LATENCY_GT_32 88000 22 2000
MEM_TRANS_RETIRED.LOAD_LATENCY_GT_64 12000 3 2000
amplxe: Executing actions 100 % done
Much better. However I would expect the total number of events to be samples * events_per_sample. However the number I get
is two times more. Why is that?
Thanks.
- Marquer comme nouveau
- Marquer
- S'abonner
- Sourdine
- S'abonner au fil RSS
- Surligner
- Imprimer
- Signaler un contenu inapproprié
I explained to you why MEM_TRANS_RETIRED.LOAD_LATENCY_GT_32 count was zero, due to its SAV = 100007, no sample was captured - it didn't mean event didn't occur...
You have good question - why their counts = 2 * SAV * samples. I think that the reason was - sometime two events occurred at same time, but VTune can only record one event at a time. If you profile their events separately - you will get "counts = SAV * samples".
- Marquer comme nouveau
- Marquer
- S'abonner
- Sourdine
- S'abonner au fil RSS
- Surligner
- Imprimer
- Signaler un contenu inapproprié
Hi Peter,
I wrote
"A high default SAV number indeed explains why GT_32 is 0."
I tried to say that I understood your explanation in your first post. I apologize for the misscommunication,
my English skills are not that good.
About counts = 2 * SAV * samples, yes you are correct, If I profile only one LATENCY event then
counts = SAV * samples. To be more precise counts = NUM_OF_LATENCY_EVENTS * SAV * samples,
but the end result (total counts) stays (more or less) the same when using 1 2 or 3 events, simultaneously,
so the end result is accurate.
Thanks again, for your answers.
Regards,
Aram
- Marquer comme nouveau
- Marquer
- S'abonner
- Sourdine
- S'abonner au fil RSS
- Surligner
- Imprimer
- Signaler un contenu inapproprié
@Aram
How many cores are in your system? Each core will generate this sample, after the SAV number of events.
- Marquer comme nouveau
- Marquer
- S'abonner
- Sourdine
- S'abonner au fil RSS
- Surligner
- Imprimer
- Signaler un contenu inapproprié
Hello MrAnderson,
the system is a two socket IVB-EP machine. Each package has 12 physical cores. For this experiment
I have HT enabled so the visible processors are 48 in total.
I can't see a correlation with the number of available cores on my system and the following
formula
COUNTS = NUM_OF_LATENCY_EVENTS * SAV * SAMPLES.
Additionally the profiled program was single threaded. I think the explanation is related
with mechanism that MEM_TRANS_RETIRED.LOAD_LATENCY_GT_* events are collected.
Currently I'm reading documentation to figure this out.
- Marquer comme nouveau
- Marquer
- S'abonner
- Sourdine
- S'abonner au fil RSS
- Surligner
- Imprimer
- Signaler un contenu inapproprié
Hi Aram:
Another thought, and I just haven't taken the time to check, but you might check the documentation (Software Development Manual) for the exact processor family. Sometimes, the hardware counters are known to "double count". This is something out of VTune Amplifier XE's control. Also, have checked out the Tuning Guides? There might be some guidance wrt these counters in there.
- Marquer comme nouveau
- Marquer
- S'abonner
- Sourdine
- S'abonner au fil RSS
- Surligner
- Imprimer
- Signaler un contenu inapproprié
- Marquer comme nouveau
- Marquer
- S'abonner
- Sourdine
- S'abonner au fil RSS
- Surligner
- Imprimer
- Signaler un contenu inapproprié
Hi,
thank you for the pointers you provide I am checking them out.
In the related article from the above post, the author claims that only one LATENCY event can be sampled at a given time period, although the explanation for this limitation is not clear to me yet. If I collect LATENCY_GT_4 and LATENCY_GT_64 at a given time and a load instruction with 100 cycles latency is encountered, it is perfectly reasonable to me that both GT_4 and GT_64 must be incremented. I tried few test by using single event and multiple events (2-3) I couldn't find any discrepancy.
Cheers

- S'abonner au fil RSS
- Marquer le sujet comme nouveau
- Marquer le sujet comme lu
- Placer ce Sujet en tête de liste pour l'utilisateur actuel
- Marquer
- S'abonner
- Page imprimable