Analyzers
Talk to fellow users of Intel Analyzer tools (Intel VTune™ Profiler, Intel Advisor)
4975 Discussions

BUG: mem_load_l3_miss_retired.remote_dram , mem_load_l3_miss_retired.local_dram

panda__akash
Novice
5,559 Views

The performance counters mem_load_l3_miss_retired.remote_dram and mem_load_l3_miss_retired.local_dram giving wrong numbers when the processes are making accesses to shared memory. 

0 Kudos
34 Replies
ArunJ_Intel
Moderator
1,866 Views

Hi Akash,


We are work on analyzing various hardware events to get to know the root cause of the issue. Will update you once we reach a conclusion .


Thanks

Arun


0 Kudos
Eric_M_Intel2
Employee
1,863 Views

Hello

I have started looking at the issue with ArunJ, who has created a system that can reproduce the issue here at Intel.  We are looking at the events to determine what is going on.

My unconfirmed hypothesis is that  mem_load_l3_miss_retired.remote_dram is firing when… bin_sharedmem-set reads data which was loaded into the remote cache at some point by bin_sharedmem-get. 

The events mem_load_l3_miss_retired.remote_dram and mem_load_l3_miss_retired.local_dram depend on a communication protocol via the L3 back to the core that tells the core what supplied the data.  That communication protocol is also used by the “Offcore” events… described in the IA32 SDM Volume 3B Chapter 18 https://software.intel.com/content/www/us/en/develop/articles/intel-sdm.html#nine-volume.  With the offcore events I can get many more specific details.  To see explicit details I have to create MSR mask that gives more details.  With this process we are trying to diagnose what is happening.

2 subsets of the MSR mask are firing in this scenario which are unexpected.

REMOTE_HOP0 is firing – which is supposed to be indicative of crossing a Sub NUMA boundary on the same socket, and I didn’t expect that to fire in this scenario.  As the system does not appear to have sub numa clustering enabled.   I am trying to figure out why.  Am I testing wrong? Or is the event misfiring?  We are working on that.

SNOOP_MISS is firing – as you have a secondary binary loading the data from DRAM into the remote sockets cache.  mem_load_l3_miss_retired.remote_dram might be counting REMOTE SNOOP_MISS – even if the data is ultimately supplied by local dram.

When we have more info I will get back to you.

0 Kudos
panda__akash
Novice
1,854 Views

Thanks Eric for looking into it. 

In my opinion, also REMOTE_HOP0 is misfiring here. I do not expect it to fire in this situation also. 

As you say, even if SNOOP_MISS  is firing, should the  mem_load_l3_miss_retired.remote_dram count the snoop misses? I think it should not. 

Let me know if I can provide more details if needed. 

Regards, 
Akash 

0 Kudos
Eric_M_Intel2
Employee
1,833 Views

Hello Akash

Can yo uplease tell me if the following workaround will work for you

perf stat -a -C $CPU1 -x, -o $OUTFILE1 --append -e cpu/event=bb,umask=0x1,offcore_rsp=0x07B0000001,name=offcore_resp_DEMAND_DATA_RD.REMOTE_DRAM

,cpu/event=bb,umask=0x1,offcore_rsp=0x079C000001,name=offcore_resp_DEMAND_DATA_RD.LOCAL_DRAM -p $PID1 -I 5000
perf stat -a -C $CPU2 -x, -o $OUTFILE2 --append -e cpu/event=bb,umask=0x1,offcore_rsp=0x07B0000001,name=offcore_resp_DEMAND_DATA_RD.REMOTE_DRAM

,cpu/event=bb,umask=0x1,offcore_rsp=0x079C000001,name=offcore_resp_DEMAND_DATA_RD.LOCAL_DRAM dram -p $PID2 -I 5000

 

The difference being that these events will report Retired and Speculative Loads vs just Retired. 

panda__akash
Novice
1,812 Views

Hi Eric,

@Eric_M_Intel2 
I did try out the solution proposed by you. Now I am getting good results for the set process (which was supposed to give a majority of local accesses).

But to my surprise, for the get process, I now see a lot of local accesses(which should not have happened). As all the memory it accesses is from the other node. Even if it adds speculative loads also, still the speculative loads should also have come from the remote node(and not local).

Regards,

Akash

0 Kudos
Eric_M_Intel2
Employee
1,788 Views

Oops- My copy mistake - I gave you the wrong hex for LOCAL_DRAM.. try this....

 

perf stat -a -C $CPU1 -x, -o $OUTFILE1 --append -e cpu/event=bb,umask=0x1,offcore_rsp=0x07B0000001,name=offcore_resp_DEMAND_DATA_RD.REMOTE_DRAM

,cpu/event=bb,umask=0x1,offcore_rsp=0x0784000001,name=offcore_resp_DEMAND_DATA_RD.LOCAL_DRAM -p $PID1 -I 5000
perf stat -a -C $CPU2 -x, -o $OUTFILE2 --append -e cpu/event=bb,umask=0x1,offcore_rsp=0x07B0000001,name=offcore_resp_DEMAND_DATA_RD.REMOTE_DRAM

,cpu/event=bb,umask=0x1,offcore_rsp=0x0784000001,name=offcore_resp_DEMAND_DATA_RD.LOCAL_DRAM dram -p $PID2 -I 5000

 

 

0 Kudos
Eric_M_Intel2
Employee
1,783 Views

Oops- My copy mistake - I gave you the wrong hex for LOCAL_DRAM.. try this....  I accidentally had Bit 28 set for REMOTE_HOP1

perf stat -a -C $CPU1 -x, -o $OUTFILE1 --append -e cpu/event=bb,umask=0x1,offcore_rsp=0x07B0000001,name=offcore_resp_DEMAND_DATA_RD.REMOTE_DRAM

,cpu/event=bb,umask=0x1,offcore_rsp=0x078C000001,name=offcore_resp_DEMAND_DATA_RD.LOCAL_DRAM -p $PID1 -I 5000
perf stat -a -C $CPU2 -x, -o $OUTFILE2 --append -e cpu/event=bb,umask=0x1,offcore_rsp=0x07B0000001,name=offcore_resp_DEMAND_DATA_RD.REMOTE_DRAM

,cpu/event=bb,umask=0x1,offcore_rsp=0x078C000001,name=offcore_resp_DEMAND_DATA_RD.LOCAL_DRAM dram -p $PID2 -I 5000

 

 

panda__akash
Novice
1,730 Views

Hi Eric,

@Eric_M_Intel2 Thanks a lot for those suggestions. It works for me. Can you point me to a document or something that can explain it to me how did you come up with that hex code? 
@ArunJ_Intel Yes Eric's workaround did work for me. I am able to get the numbers that I wanted. 

I am really happy about the responses that I am getting on this portal. 

Thanks, 
Akash Panda

Eric_M_Intel2
Employee
1,717 Views

The bits I used are defined in the Intel(R) 64 and IA-32 Architecture Software Developer's Manual (https://software.intel.com/content/www/us/en/develop/articles/intel-sdm.html) Volume 3B, Chapter 18, 18.3.8.2 Offcore Response Performance Monitoring, and 18.3.8.2.1 Offcore Response Monitoring for the Intel(R) Xeon(R) Processor Scalable Family. 

regards,
Eric Moore

0 Kudos
ArunJ_Intel
Moderator
1,763 Views

Hey Akash,


Hope you have tried out the solution provided by Eric. Does that resolve your isse?


Thanks

Arun Jose


0 Kudos
panda__akash
Novice
1,758 Views

Hi Arun,

Eric provided a workaround for the information that I wanted. I am trying various experiments to validate it. 
I will get back to you once all the experiments are over and I get to a point that the solution works. If I need more clarifications, I shall ask for them. 

Regards, 
Akash 

0 Kudos
ArunJ_Intel
Moderator
1,706 Views

Hi Akash,


Glad to know your issue has been resolved. Eric has provided a link to the document you have requested.


Could you please confirm if we could stop monitoring this thread as your issue has reached a conclusion.


Thanks 

Arun


0 Kudos
panda__akash
Novice
1,677 Views

Hi Arun, 

Yes. You can stop monitoring the thread. 

Thanks.
Akash

0 Kudos
ArunJ_Intel
Moderator
1,673 Views

Hi Akash,


Thank you for the confirmation. We wouldn't be monitoring this thread anymore, please feel free to raise a new thread in case of any further issues.



Thanks

Arun







0 Kudos
Reply