Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.
1708 Discussions

How to measure remote read or write request of a core using PMU?

Duan_Z_
Beginner
510 Views

I want to measure the count of remote dram read or write request of each core.

So I use the Cbox.But I can't found the base event.

But I found pmu_tool (https://github.com/andikleen/pmu-tools)​  has the event "OFFCORE_RESPONSE.ALL_READS.LLC_MISS.REMOTE_DRAM" can measure the count of each core.

Then I found the definition of this event , the event code is "0xB7, 0xBB" and the umask is " 0x01".But it doesn't said using which box.

So I check it in Intel® Xeon® Processor E5 and E7 v3 Family Uncore Performance Monitoring Reference Manual​ but can not find the event related to it.

I want to know how to implement this event using the PMU to measure the remote read/write count of each core.

 

 

 

 

 

0 Kudos
1 Solution
Thomas_G_4
New Contributor II
510 Views

Hi,

the OFFCORE_RESPONSE events are core-local events thus are not related to any 'box' (MSR_PMC0-7 with config register MSR_PERFEVTSEL0-7). You can program them through the MSR interface.
Depending on the event code "0xB7, 0xBB", you have to program the filter bitmask (many tools use human-readable names that specify a bitmask) in either register 0x1A6 or 0x1A7 (event 0xB7 -> register 0x1A6). Which bits can be set and their meaning is architecture specific and can be found in the SDM or in the matrix_bit_definitions files at https://download.01.org/perfmon . The matrix_bit_definitions files use the same names as pmu-tools thus it should be easy to retrieve the bitmask that is used by pmu-tools.

You will probably have a problem with 'remote DRAM write requests of each core'. The measurement facility is located between a core's private L2 and the 'ring' interconnect of the Uncore (L3, DRAM, QPI, ...). With this location, it is not possible to assign evicted cache lines from L3 to DRAM to a single CPU core. The 'COREWB' bit (WriteBack) in 0x1A6 or 0x1A7 counts dirty cache lines evicted from L2 to L3 and not from L3 to DRAM, so you cannot use it for memory bandwidth in general. (This paragraph is loosely cited from an email of Dr. Bandwidth to the PAPI mailing list)

View solution in original post

0 Kudos
3 Replies
Thomas_G_4
New Contributor II
511 Views

Hi,

the OFFCORE_RESPONSE events are core-local events thus are not related to any 'box' (MSR_PMC0-7 with config register MSR_PERFEVTSEL0-7). You can program them through the MSR interface.
Depending on the event code "0xB7, 0xBB", you have to program the filter bitmask (many tools use human-readable names that specify a bitmask) in either register 0x1A6 or 0x1A7 (event 0xB7 -> register 0x1A6). Which bits can be set and their meaning is architecture specific and can be found in the SDM or in the matrix_bit_definitions files at https://download.01.org/perfmon . The matrix_bit_definitions files use the same names as pmu-tools thus it should be easy to retrieve the bitmask that is used by pmu-tools.

You will probably have a problem with 'remote DRAM write requests of each core'. The measurement facility is located between a core's private L2 and the 'ring' interconnect of the Uncore (L3, DRAM, QPI, ...). With this location, it is not possible to assign evicted cache lines from L3 to DRAM to a single CPU core. The 'COREWB' bit (WriteBack) in 0x1A6 or 0x1A7 counts dirty cache lines evicted from L2 to L3 and not from L3 to DRAM, so you cannot use it for memory bandwidth in general. (This paragraph is loosely cited from an email of Dr. Bandwidth to the PAPI mailing list)

0 Kudos
Duan_Z_
Beginner
510 Views

Thanks for your reply.

Let's put aside the problem 'remote DRAM write requests of each core'. Can I use the PMU(may be the Cbox) to detect 'remote DRAM read requests of each core'?

Because my previous projects are based on the PMU implementation, but I do not know core-local events, do not know how to use my original project in the MSR interface implementation, I want to ask you have information on this convenient so that I can Fast understanding and use?

0 Kudos
Duan_Z_
Beginner
510 Views

Thanks a lot,

I already use the memory controllers (on the remote socket) to simulation something on the situation when local socket just have one core (others are shut down).

But when this simulator came to the case of multi-core, imc did not know that this read and write instructions from which core.

I'm sorry about the problem of msr or core-local I don't say it clearly. I set aside a complete set of msr interfaces and implemented.

​But I don't have a sheet about core-local(or msr) events(I must say that I just know a little about it).

Can you give some sheet about MSR or core-local events?

Thanks for your help.

0 Kudos
Reply