Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.

Does operating frequency influence cache misses?

kopcarl
Beginner
1,685 Views
I run 462.libquantum on my i5-2400 in 1.6Ghz, 2.1Ghz, 2.7Ghz and 3.1Ghz respectively, and I find that LLC misses increase in higher frequency. The details are as follows: LLC miss: 5E+09, 6.9E+09, 9E+09, 1E+10 in (1.6Ghz, 2.1Ghz, 2.7Ghz, 3.1Ghz). I am wondering why changing frequency can influence cache misses?
0 Kudos
27 Replies
Patrick_F_Intel1
Employee
1,438 Views

Hello kopcarl,

I don't really know why you are seeing the numbers you are seeing. How are measuing you LLC cache misses (which utility are you using)? Is the tool reporting total cache misses over the run or misses/sec? How are you running libquantum? just 1 thread? or multiple threads? Is libquantum the only thing (more or less) running?

There is a pdf Analyzing Libquantum - Rogue Wave Software that indicates libquantum fetches data that doesn't get used. I don't know if the issues described are accurate or still true. It is possible if more than 1 thread is running that 1 thread is kicking out the data needed by another thread. Or maybe that the tool you are using to measure bandwidth runs more frequently (more samples/second) as the frequency increases. I don't know how long libquantum runs so I can't really tell if the 'tool as an issue' possibility is realistic.

Pat

0 Kudos
kopcarl
Beginner
1,438 Views
Hi pat, thank you for your quick reply:) I write a self-made code to monitor LLC miss. In fact, I write 0x53412e(LLC Misses) into 0x186 and keep watching on 0xc1 every 10 Million cycles. Considering the overflow, I reset 0xc1 to 0 once I monitor it until the process ends. The total cache misses is over the run. I sum the number every time i monitor . Libquantum is a single thread program from SPEC06, and it is running on my i5-2400 with Linux-3.6.0 with my monitor process. I have some level of confidence with my code because I also monitor other events when running libquantum at 1.6Ghz, 2.1Ghz, 2.7Ghz and 3.1Ghz respectively . 2.74E+12 2.16E+12 1.77E+12 1.64E+12 (UnHalted Reference Cycles) (885.33sec, 697.17sec, 571.37sec, 527.49sec) 1.42E+12 1.46E+12 1.54E+12 1.64E+12 (UnHalted Core Cycles) 2.86E+12 2.86E+12 2.86E+12 2.87E+12 (Instructions Retired) And these numbers look real. I admit that the monitor process need another core to run when libquantum is running. But i consider the monitor process will not mess around on LLC. carl
0 Kudos
McCalpinJohn
Honored Contributor III
1,438 Views

Event 2Eh, Mask 41h is the "architectural" performance counter event for LLC misses.  The Intel Architecture SW Developer's Manual, Volume 3, Chapter 18, section 18.2.3 describes these predefined architectural events.  For this event, the document says:

Last Level Cache References— Event select 2EH, Umask 4FH This event counts requests originating from the core that reference a cache line in the last level cache. The event count includes speculation and cache line fills due to the first-level cache hardware prefetcher, but may exclude cache line fills due to other hardware-prefetchers. Because cache hierarchy, cache sizes and other implementation-specific characteristics; value comparison to estimate performance differences is not recommended.

The most important item here is that the count "may" exclude cache line fills due to the L2 hardware prefetchers.

As you slow down the core frequency, you decrease the rate at which it "consumes" data.  This provides the L2 prefetchers more time to get the requested data into the L3 cache, which in turn decreases the L3 cache miss rate.

Sometimes you want to count the total amount of data moved (in which case this counter is not helpful), and sometimes you are more interested in how many memory accesses experience stalls due to missing in the caches (either because the target was not prefetched or because the target was not prefetched early enough to get the data in the cache before the demand request).  This counter event is more appropriate for the latter case.

If you want to know the total amount of data moved to/from the L3 cache for this processor, the best place to look is the memory controller counters.  These are available using VTune Amplifier XE 2013 Update 5, or Intel PCM version 2.4 or later, or you can roll your own analysis tools using the documentation that Intel released on 2013-03-15 (the article is titled "Monitoring Integrated Memory Controller Requests in the 2nd, 3rd and 4th generation Intel® Core™ processors").  If you roll your own tools, you should note that the counters are 32 bits, so they can roll over in about 13 seconds when the system is running at its maximum bandwidth.

0 Kudos
SergeyKostrov
Valued Contributor II
1,438 Views
>>...LLC miss: 5E+09, 6.9E+09, 9E+09, 1E+10 in (1.6Ghz, 2.1Ghz, 2.7Ghz, 3.1Ghz). These numbers need to be normalized to some base frequency. If the pattern of processing is always the same numbers of cache misses also must be consistent. Don't forget, that all your tests can not be rated as deterministic in non-deterministic operating system because you can not simply stop all the rest system threads in order to get as accurate as possible numbers. Even a priority boost of a thread with your test processing doesn't resolve that problem completely. >>...I have some level of confidence with my code because I also monitor other events when running libquantum >>at 1.6Ghz, 2.1Ghz, 2.7Ghz and 3.1Ghz respectively... Did you compare your numbers with VTune numbers for the same test case?
0 Kudos
Bernard
Valued Contributor I
1,438 Views

>>>I am wondering why changing frequency can influence cache misses?>>>

I do not have a direct answer,but I suppose that when frequency is increasing more total work is done hence when the program runs it could? generate more cache misses(just my uneducated guess).

Now I would also check if your results are repeatable each time you are measuring the cache miss rate.Wildly varying values can indicate  the existence of some transient effects which can lead to different results.As it was pointed out your testing environment is non-deterministic and even there could be a possibility related to context switching when your monitoring code is scheduled to run on the same core when only libquantum thread was running before thus polluting the results.Not to mention ssystem threads activity during the same time window.

0 Kudos
kopcarl
Beginner
1,438 Views

John D. McCalpin wrote:

The most important item here is that the count "may" exclude cache line fills due to the L2 hardware prefetchers.

As you slow down the core frequency, you decrease the rate at which it "consumes" data.  This provides the L2 prefetchers more time to get the requested data into the L3 cache, which in turn decreases the L3 cache miss rate.

Thank you, John D. McCalpin.

Do you imply that the operating frequency of L2/LLC will not change even if the core frequency is increasing/decreasing?

John D. McCalpin wrote:

These are available using VTune Amplifier XE 2013 Update 5, or Intel PCM version 2.4 or later, or you can roll your own analysis tools using the documentation that Intel released on 2013-03-15 (the article is titled "Monitoring Integrated Memory Controller Requests in the 2nd, 3rd and 4th generation Intel® Core™ processors").  

This is very helpful! But when I run pcm-memory.x (from PCM 2.5) on i5-2400, it does not work well. It needsJaketown.

0 Kudos
kopcarl
Beginner
1,438 Views

Sergey Kostrov wrote:

>>...LLC miss: 5E+09, 6.9E+09, 9E+09, 1E+10 in (1.6Ghz, 2.1Ghz, 2.7Ghz, 3.1Ghz).

These numbers need to be normalized to some base frequency.

Thank you for your help, Sergey. But why these numbers need to be normalized? I don't get it. 

Sergey Kostrov wrote:

If the pattern of processing is always the same numbers of cache misses also must be consistent. Don't forget, that all your tests can not be rated as deterministic in non-deterministic operating system because you can not simply stop all the rest system threads in order to get as accurate as possible numbers. Even a priority boost of a thread with your test processing doesn't resolve that problem completely.

Actually i run this test for several times, and the results are very close. 

Sergey Kostrov wrote:

Did you compare your numbers with VTune numbers for the same test case?

Frankly speaking ,i did not compare the numbers with Vtune. I will have a try. 

0 Kudos
SergeyKostrov
Valued Contributor II
1,438 Views
>>...This is very helpful! But when I run pcm-memory.x (from PCM 2.5) on i5-2400, it does not work well... Could you post more technical details?
0 Kudos
Roman_D_Intel
Employee
1,438 Views

This is very helpful! But when I run pcm-memory.x (from PCM 2.5) on i5-2400, it does not work well. It needsJaketown.

For Intel Core i5-2400 you should run pcm.x (Linux). It has the memory read and write traffic in GBytes in the READ and WRITE columns.

0 Kudos
kopcarl
Beginner
1,438 Views

Sergey Kostrov wrote:

>>...This is very helpful! But when I run pcm-memory.x (from PCM 2.5) on i5-2400, it does not work well...

Could you post more technical details?

Sure!

[bash]

root@***:~/pmc/IntelPerformanceCounterMonitorV2.5# ./pcm-memory.x

Intel(r) Performance Counter Monitor: Memory Bandwidth Monitoring Utility

Copyright (c) 2009-2012 Intel Corporation
This utility measures memory bandwidth per channel in real-time

Num logical cores: 4
Num sockets: 1
Threads per core: 1
Core PMU (perfmon) version: 3
Number of core PMU generic (programmable) counters: 8
Width of generic (programmable) counters: 48 bits
Number of core PMU fixed counters: 3
Width of fixed counters: 48 bits
Nominal core frequency: 3100000000 Hz
Package thermal spec power: 95 Watt; Package minimum power: 60 Watt; Package maximum power: 120 Watt;

Detected Intel(R) Core(TM) i5-2400 CPU @ 3.10GHz "Intel(r) microarchitecture codename Sandy Bridge"
Jaketown CPU is required for this tool! Program aborted
Cleaning up

[/bash]

and if i comment these lines,  I just want to struggle :)  

if(cpu_model != m->JAKETOWN)
{
cout << "Jaketown CPU is required for this tool! Program aborted" << endl;
m->cleanup();
return -1;
}

it returns as follows: 

[bash]

Detected Intel(R) Core(TM) i5-2400 CPU @ 3.10GHz "Intel(r) microarchitecture codename Sandy Bridge"
Update every 1 seconds
Time elapsed: 998 ms
Called sleep function for 1000 ms
---------------------------------------|
-- Socket 0 --|
---------------------------------------|
---------------------------------------|
---------------------------------------|
-- Memory Performance Monitoring --|
---------------------------------------|
-- Mem Ch 0: Reads (MB/s): 0.00 --|
-- Writes(MB/s): 0.00 --|
-- Mem Ch 1: Reads (MB/s): 0.00 --|
-- Writes(MB/s): 0.00 --|
-- Mem Ch 2: Reads (MB/s): 0.00 --|
-- Writes(MB/s): 0.00 --|
-- Mem Ch 3: Reads (MB/s): 0.00 --|
-- Writes(MB/s): 0.00 --|
-- ND0 Mem Read (MB/s): 0.00 --|
-- ND0 Mem Write (MB/s) : 0.00 --|
-- ND0 P. Write (T/s) : 0 --|
-- ND0 Memory (MB/s): 0.00 --|
---------------------------------------||---------------------------------------
-- System Read Throughput(MB/s): 0.00 --
-- System Write Throughput(MB/s): 0.00 --
-- System Memory Throughput(MB/s): 0.00 --
---------------------------------------||---------------------------------------

[/bash]

0 Kudos
Roman_D_Intel
Employee
1,438 Views

does pcm.x work for you?

0 Kudos
kopcarl
Beginner
1,438 Views

Roman Dementiev (Intel) wrote:

Quote:

This is very helpful! But when I run pcm-memory.x (from PCM 2.5) on i5-2400, it does not work well. It needsJaketown.

For Intel Core i5-2400 you should run pcm.x (Linux). It has the memory read and write traffic in GBytes in the READ and WRITE columns.

Thank you! I am  stupid. :)-

0 Kudos
Roman_D_Intel
Employee
1,438 Views

Thank you! I am  stupid. :)-

not at all. Perhaps pcm-memory should be extended with pcm.x client memory controller info. Currently pcm-memory supports only server processors.

0 Kudos
kopcarl
Beginner
1,438 Views

Roman Dementiev (Intel) wrote:

does pcm.x work for you?

nope.

Core (SKT) | EXEC | IPC | FREQ | AFREQ | L3MISS | L2MISS | L3HIT | L2HIT | L3CLK | L2CLK | READ | WRITE | TEMP

0 0 0.00 0.32 0.00 0.63 9002 22 K 0.60 0.20 0.15 0.05 N/A N/A 67
1 0 0.00 0.36 0.00 0.57 560 1142 0.51 0.00 0.44 0.09 N/A N/A 67
2 0 0.00 0.81 0.00 0.66 1512 7087 0.79 0.45 0.08 0.06 N/A N/A 67
3 0 0.00 0.85 0.00 0.66 283 1596 0.82 0.39 0.04 0.06 N/A N/A 67
-------------------------------------------------------------------------------------------------------------------
SKT 0 0.00 0.46 0.00 0.64 11 K 32 K 0.65 0.28 0.13 0.05 0.00 0.00 67
-------------------------------------------------------------------------------------------------------------------
TOTAL * 0.00 0.46 0.00 0.64 11 K 32 K 0.65 0.28 0.13 0.05 0.00 0.00 N/A

0 Kudos
Roman_D_Intel
Employee
1,438 Views

can you run "./memoptest 0" in parallel and post pcm.x output? This is a memory test from PCM: build it with "make memoptest".

0 Kudos
kopcarl
Beginner
1,438 Views

Roman Dementiev (Intel) wrote:

can you run "./memoptest 0" in parallel and post pcm.x output? This is a memory test from PCM: build it with "make memoptest".

sure!

EXEC : instructions per nominal CPU cycle
IPC : instructions per CPU cycle
FREQ : relation to nominal CPU frequency='unhalted clock ticks'/'invariant timer ticks' (includes Intel Turbo Boost)
AFREQ : relation to nominal CPU frequency while in active state (not in power-saving C state)='unhalted clock ticks'/'invariant timer ticks while in C0-state' (includes Intel Turbo Boost)
L3MISS: L3 cache misses
L2MISS: L2 cache misses (including other core's L2 cache *hits*)
L3HIT : L3 cache hit ratio (0.00-1.00)
L2HIT : L2 cache hit ratio (0.00-1.00)
L3CLK : ratio of CPU cycles lost due to L3 cache misses (0.00-1.00), in some cases could be >1.0 due to a higher memory latency
L2CLK : ratio of CPU cycles lost due to missing L2 cache but still hitting L3 cache (0.00-1.00)
READ : bytes read from memory controller (in GBytes)
WRITE : bytes written to memory controller (in GBytes)
TEMP : Temperature reading in 1 degree Celsius relative to the TjMax temperature (thermal headroom): 0 corresponds to the max temperature


Core (SKT) | EXEC | IPC | FREQ | AFREQ | L3MISS | L2MISS | L3HIT | L2HIT | L3CLK | L2CLK | READ | WRITE | TEMP

0 0 1.53 1.44 1.06 1.06 2289 K 2300 K 0.00 0.28 0.13 0.00 N/A N/A 39
1 0 0.00 0.16 0.01 1.06 121 K 134 K 0.10 0.02 0.99 0.03 N/A N/A 42
2 0 2.07 1.94 1.06 1.06 26 M 28 M 0.08 0.09 1.43 0.02 N/A N/A 35
3 0 0.00 0.46 0.00 1.06 20 K 22 K 0.10 0.07 0.78 0.02 N/A N/A 45
-------------------------------------------------------------------------------------------------------------------
SKT 0 0.90 1.68 0.53 1.06 28 M 30 M 0.07 0.11 0.78 0.01 9.67 2.84 35
-------------------------------------------------------------------------------------------------------------------
TOTAL * 0.90 1.68 0.53 1.06 28 M 30 M 0.07 0.11 0.78 0.01 9.67 2.84 N/A

Instructions retired: 11 G ; Active cycles: 6614 M ; Time (TSC): 3094 Mticks ; C0 (active,non-halted) core residency: 50.20 %

C1 core residency: 0.15 %; C3 core residency: 0.00 %; C6 core residency: 49.65 %; C7 core residency: 0.00 %
C2 package residency: 0.00 %; C3 package residency: 0.00 %; C6 package residency: 0.00 %; C7 package residency: 0.00 %

PHYSICAL CORE IPC : 1.68 => corresponds to 42.12 % utilization for cores in active state
Instructions per nominal CPU cycle: 0.90 => corresponds to 22.51 % core utilization over time interval
----------------------------------------------------------------------------------------------

----------------------------------------------------------------------------------------------
SKT 0 package consumed 39.01 Joules
----------------------------------------------------------------------------------------------
TOTAL: 39.01 Joules

0 Kudos
Roman_D_Intel
Employee
1,438 Views

Core (SKT) | EXEC | IPC | FREQ | AFREQ | L3MISS | L2MISS | L3HIT | L2HIT | L3CLK | L2CLK | READ | WRITE | TEMP

SKT 0 0.90 1.68 0.53 1.06 28 M 30 M 0.07 0.11 0.78 0.01 9.67 2.84 35
-------------------------------------------------------------------------------------------------------------------
TOTAL * 0.90 1.68 0.53 1.06 28 M 30 M 0.07 0.11 0.78 0.01 9.67 2.84 N/A

I highlighted the read and write traffic in the pcm output above.

0 Kudos
kopcarl
Beginner
1,438 Views

Roman Dementiev (Intel) wrote:

Quote:

Core (SKT) | EXEC | IPC | FREQ | AFREQ | L3MISS | L2MISS | L3HIT | L2HIT | L3CLK | L2CLK | READ | WRITE | TEMP

SKT 0 0.90 1.68 0.53 1.06 28 M 30 M 0.07 0.11 0.78 0.01 9.67 2.84 35
-------------------------------------------------------------------------------------------------------------------
TOTAL * 0.90 1.68 0.53 1.06 28 M 30 M 0.07 0.11 0.78 0.01 9.67 2.84 N/A

I highlighted the read and write traffic in the pcm output above.

Thanks a lot. I see.

0 Kudos
Roman_D_Intel
Employee
1,438 Views

You are welcome. It seems your previous pcm measurement was on an idle system.

0 Kudos
kopcarl
Beginner
1,286 Views

Roman Dementiev (Intel) wrote:

You are welcome. It seems your previous pcm measurement was on an idle system.

Yes, you are absoultely right! And cound help me with my confusion of the varying LLC miss due to frequency changing?

0 Kudos
Reply