Why cache miss ratio is so high on Intel(R) Xeon(R) Gold 6138 CPU. – Seite 2

shang__xiaowei · ‎03-27-2019

Dear experts,

I test the cache miss ratio under the following experiment on Intel(R) Xeon(R) Gold 6138 CPU.

In the experiment, I use Intel CAT technology to allocate 5MB (i.e., 2-way) to a core on one socket, and run a thread on the core to sequentially access an array with sizes of 1MB, 2MB, 3MB, 4MB and 5MB respectively. I find the cache miss ratio is like following.

Array_size Cache miss ratio

1MB 0.01%

2MB 5.55%

3MB 15.43%

4MB 24.35%

5MB 34.49%

I cannot understand 1) why the cache miss ratio is so high? 2) why the cache miss ratio is increasing consecutively when the array size is increasing within the its cache partitioning size (i.e., 5MB).

After the above experiment, I also conduct another two experiments and see the similar trends of cache miss ratio. The first experiment is that I don’t use intel CAT to allocate cache size for the core (i.e., the core can use the full size of L3 cache), and the cache miss ratio trend is similar when the array size increases from 1MB to 27.5MB.

The second experiment is that I use huge pages (i.e., 1GB huge page) to redo above two experiments, and I also see the similar cache miss ratio trends.

Would you please help me understand my questions? Thanks much in advance.

Our CPU hardware parameters:

Core(s) per socket: 20

Socket(s): 4

Model name: Intel(R) Xeon(R) Gold 6138 CPU @ 2.00GHz

L1d cache: 32K

L1i cache: 32K

L2 cache: 1024K

L3 cache: 28160K

Cache information

L3 Cache

Num ways: 11

Way size: 2621440 bytes

Num sets: 40960

Line size: 64 bytes

Total size: 28835840 bytes

L2 Cache

Num ways: 16

Way size: 65536 bytes

Num sets: 1024

Line size: 64 bytes

Total size: 1048576 bytes

HadiBrais · ‎04-03-2019

The number of cache lines filled into the L3 is the sum of:

IDI_MISC.WB_UPGRADE: The number of lines filled into the L3 from L2 writebacks.
OFFCORE_RESPONSE with MSR value of 0x063F800080: The number of lines filled into the L3 from local or remote memory due to data read hardware prefetching. My understanding from John's comments is that this is the only situation in which lines are filled into the L3 from memory due to data reads.

Regarding the number of cache lines filled into the L2:

The number of lines filled into the L2 from local or remote memory and not from the L3 can be measured using OFFCORE_RESPONSE with MSR value of 0x063F800011. This includes demand data read requests (from instructions that retire or otherwise), hardware prefetch data read requests, and (I think) software prefetch data read requests. It excludes all code and RFO requests.
The number of lines filled into the L2 from the L3 can be measured using OFFCORE_RESPONSE with MSR value of 0x3F803C0011. This includes demand data read requests (from instructions that retire or otherwise), hardware prefetch data read requests, and (I think) software prefetch data read requests. It excludes all code and RFO requests.
The number of lines filled into the L2 from the L3 or from memory for any type of request can be measured using L2_LINES_IN.ALL.

It would be interesting to also measure L2_LINES_OUT.SILENT and L2_LINES_OUT.NON_SILENT and see how they correlate with other events.

the equation should be roughly correct: (# of cache lines loaded from memory to L2) + (# of cache lines loaded from memory to L3) = the # of IDI_MISC.WB_DOWNGRADE.

No, not necessarily. Consider the following cases:

If a line is filled into the L2 from the L3 or memory, but never evicted, then this equation will not hold. There will be one event in the left-hand side of the equation, but zero events in the right-hand side.
If a line is filled into the L2 from L3 or memory, then evicted and not written into the L3, and then filled into the L2 again. There will be two events in the left-hand side of the equation, but one event in the right-hand side.

McCalpinJohn · ‎04-04-2019

I made a fairly serious effort to find a set of events that counted all the traffic for an (L2+L3)-contained repeated read test, but did not find any fully satisfactory combinations....

L2_LINES_IN.ALL gave accurate counts (within 0.5% of expected values).

L2_TRANS.L2_WB does not count writebacks due to Snoop Filter Evictions, but (core) L2_TRANS.L2_WB + (CHA) SF_EVICTIONS.M+E+S matches expected counts (within 0.5%). Since addresses are hashed over all the CHAs, this formula only makes sense for the sum of all core L2_TRANS.L2_WB plus the sum of all CHA SF_EVICTIONS.M+E+S.

The sum of IDI_MISC.WB_UPGRADE + IDI_MISC.WB_DOWNGRADE is always smaller than L2_TRANS.L2_WB. For tests using an array sized to about 64% of the combined L2+L3 (200,000 doubles per core), the difference is negligible, but as I increased the array size the difference increased. For tests using an array sized to just under 90% of the combined L2+L3 (280,000 doubles per core) the sum of IDI_MISC.WB_UPGRADE and IDI_MISC.WB.DOWNGRADE was 6% lower than the L2_TRANS.L2_WB (which is already lower than the actual number of L2 writebacks because it does not count writebacks due to Snoop Filter Evictions).

The event L2_LINES_OUT explicitly mentions that it counts when the victimization is due to an L2 cache fill -- so Snoop Filter Evictions are not counted by that event either.

The CHA event LLC_LOOKUP.WRITE undercounts by 3-4 orders of magnitude for dirty L2 victims and by 7 orders of magnitude for clean L2 victims, and I have not found any other direct way capture writes to the L3. In some cases the write to L3 can be inferred using a combination of L2 traffic counts and mesh data traffic counts (HORZ_RING_BL_IN_USE and VERT_RING_BL_IN_USE), but I have not tried to determine whether this can be a general solution....