How can I get the L2 cache miss rate on XEON E5540 with Linux kernel 2.6.35

handlight · ‎05-11-2011

I'm a new user of Intel VTune Amplifier XE 2011 and I'm not familiar with it.

These days I'm working on an application monitoring the numerous network dataflow. The CPU

is XEON E5540 2.53G with HT enabled. The application runs on one single core of the CPU and

the application's CPU usage is above 95% at all time. So, it may trigger a large amount of L2 cache

miss, right?

I set the property to "profile system" and chose to use L2_LINES_IN.SELF.ANY /

INST_RETIRED.ANY according to "Intel 64 and IA-32 Architectures Optimization Reference

Manual(page B-74)". But unfortunately, I couldn't find L2_LINES_IN.SELF.ANY in VTune for

Linux. So I used L2_LINES_IN.ANY/INST_RETIRED.ANY to get the L2 cache miss rate. But the

result showed that the application resulted in only a little L2 cache miss (about 3% by average), which

is far more lower than what I had expected(about 20% to 40%).

So my question is ,can I use L2_LINES_IN.ANY instead of L2_LINES_IN.SELF.ANY? What's the

difference between them? Or are there any other ways can be used to get L2 cache miss

rate? I didn't have RDC installed ,all of my work is under Linux.

Wish to get your help.

Thanks.

Peter_W_Intel · ‎05-12-2011

Application's CPU usage is above 95%, it didn't imply there a large amount of L2$ misses.

It seems that your application is single thread, works on 4 cores system. So if you use L2_LINES_IN.ANY, that is OK.

My opinion is to use ratio: L2_LINES_IN.ANY / CPU_CLK_UNHALTED.CORE * 100

If percentage is greater than 20%, consider to reduce missed by modify code.

Regards, Peter

handlight · ‎05-12-2011

Thank you for your quick reply^_^ .I will take a try with ths method.

I'm still confused with the difference betweenL2_LINES_IN.ANY andL2_LINES_IN.SELF.ANY, or under which setting can I find theL2_LINES_IN.SELF.ANY and use it?

And what is the meaning of "L2_LINES_IN.ANY / CPU_CLK_UNHALTED.CORE * 100"?

Thanks:)

Peter_W_Intel · ‎05-12-2011

I think thatE5540 is Intel Core 2 processor, I found L2_LINES_IN in manual:

L2_LINES_IN.BOTH_CORES.ANY - collect all L2 misses from 4 cores

L2_LINES_IN.SELF.ANY - collect all L2 misses from this core only

Since you said that you ran single-thread application, they are equivalent.

The ratio means - L2 missed count is greater than 20, per 100 clocks, that is bad!

Regards, Peter

handlight · ‎05-14-2011

Thanks for your help :)

I checked the three groups of preset-sampling-events in the analyzer for three architecture: Core, Nehalem and Sandy Bridge. The group I can use is Nelalem, does it mean that my processor belong to Nelalem Family?

But I used none of them , I used the non-preset group.

Just like the L2_LINES_IN.SELF.ANY , I Failed to find the CPU_CLK_UNHALTED.CORE event. So I selected all events start with CPU_CLK_UNHALTED, and here is one of my test reslut:

CPU_CLK_UNHALTED.CORE.REF 339,310,000,000

CPU_CLK_UNHALTED.CORE.REF_P 17,770,400,000

CPU_CLK_UNHALTED.CORE.THREAD 356,690,000,000

CPU_CLK_UNHALTED.CORE.THREAD_P 355,938,000,000

CPU_CLK_UNHALTED.CORE.TOTAL_CYCLES 356,826,000,000

INST_RETIRED.ANY 149,382,000,000

L2_LINES_IN.ANY 3,950,000,000

So, what's the difference between the CPU_CLK_UNHALTED*s ?

TimP · ‎05-14-2011

Yes, Xeon 5540 is a Nehalem CPU. If you meant what you seem to say about wanting L2 cache miss rate, you might simply add the L2 cache line miss retired event to the default events, no obvious reason why you would explore all the CPU_CLK events other than the one normally used. But maybe you don't mean L2 cache miss rate: L3 miss rate may be more significant if not as obvious how to collect.

Peter_W_Intel · ‎05-15-2011

If you work on Nehalem Family processor - please focus on LLC miss, which has bigger penalty than L2.

Event ratio = ((MEM_LOAD_RETIRED.LLC_MISS * 180) / CPU_CLK_UNHALTED.THREAD) * 100

If ratio > 20%, consider to modify code to reduce LLC misses.

Regards, Peter