Analyzers
Talk to fellow users of Intel Analyzer tools (Intel VTune™ Profiler, Intel Advisor)
4995 Discussions

Interpreting '2nd level cache read misses retired' event

dsha2
Beginner
587 Views
Hi all,
I'm trying to estimate cache behavior of a pretty trivial C# program that does a lot of memory reads in a loop. There are two variants of the program which I sketch below:
Common prelude:
int result;
int N = 100000;
int [] array = new int ;
for (int k = 0; k < N - 1; k++) array = k + 1;
Variant 1:
for (int i = 0; i < N; i++)
for (int j = 0; j < N - 1;j++)
result += array ;
Variant 2:
for (int i = 0; i < N; i++)
for (int j = 0; j < N - 1; /* empty */)
{
result += j;
j = array ;
}
Not sure this is the exact code I was trying, but the idea should be clear. The first variant increments the counter, while the second variant fetches the next value of the counter from the array.
My original intention was to study the effect of prefetching. I profiled the program with VTune 7.2 collecting the samples for the '2nd level cache read misses retired' event (hopefully I remember its name correctly). The memory access pattern of both variants is the same (please note they access 'array' locations in the sequential order, as the 'array' is intialized appropriately). The running time of the second version is slightly bigger, but the difference is small.
Using VTune API & settings I enabled profiling only for the code in question, all the irrelevant activities were excluded from the profile.
Question: what strikes me is that VTune tells that the _first_ variant suffers ~3-4 timesmore cache misses than the second one, and this result is steady and repeatable. How can this happen?
Any ideas will be greatly appreciated.
Regards,
Dmitry
0 Kudos
7 Replies
David_A_Intel1
Employee
587 Views
Your post says "C#". Did you really mean C# or C++? It is relevant.
0 Kudos
dsha2
Beginner
587 Views
Hi Dave,
thanks for your response.
I indeed meant C# - the application was written in C# and running under .NET framework 1.1. Why is it relevant? I turned optimization on and inspected the assembly code produced by the JIT compiler. I don't think there is a big difference between that code anda code that a C++ compiler would produce for my program.
0 Kudos
David_A_Intel1
Employee
587 Views
Are you sure? Typically, .NET code is going to be jumping in and out of the CLR and this is going to significantly impact caching.
0 Kudos
dsha2
Beginner
587 Views
Hi Dave,
ok, I see your point.
However, I'm absolutely sure that no any hidden activity is involved in my scenario. Please notice that the code in question simply does an access to an array of integers - all such references are translated into native x86 instructions. There were no any function calls in the assembly code. If you are interested I may compare the assembly code generated for both variants of the C# code by the JIT compiler.
0 Kudos
David_A_Intel1
Employee
587 Views
Sure! It would be interesting for everyone, I think, to look at the code and compare it.
0 Kudos
dsha2
Beginner
587 Views
Hi Dave,
I'm back with the assembly code we were talking about yesterday.
I've obtained this code using the debugger and attempted to preserve the correspondence to source lines.
______________________________________________________
Variant 1:

for (int i = 0; i < array.Length; i++)

0000000f xor edi,edi

00000011 cmp dword ptr [esi+4],0

00000015 jle 0000003D

00000017 cmp dword ptr [esi+4],eax

for (int j = 0; j < array.Length - 1; j++)

0000001a xor edx,edx

0000001c mov ecx,dword ptr [esi+4]

0000001f dec ecx

00000020 cmp ecx,0

00000023 jle 00000037

00000025 cmp dword ptr [esi+4],eax

ourResult += array ;

00000028 mov eax,dword ptr [esi+edx*4+8]

0000002c add dword ptr ds:[00935100h],eax

for (int j = 0; j < array.Length - 1; j++)

00000032 inc edx

00000033 cmp edx,ecx

00000035 jl 00000028

for (int i = 0; i < array.Length; i++)

00000037 inc edi

00000038 cmp edi,dword ptr [esi+4]

0000003b jl 0000001A

_______________________________________________________________________

Variant 2:

for (int i = 0; i < array.Length; i++)

0000000f xor edi,edi

00000011 cmp dword ptr [esi+4],0

00000015 jle 0000003E

00000017 cmp dword ptr [esi+4],eax

for (int j = 0; j < array.Length - 1; /* empty */)

0000001a xor edx,edx

0000001c mov ecx,dword ptr [esi+4]

0000001f dec ecx

00000020 cmp ecx,0

00000023 jle 00000038

j = array ;

00000025 cmp edx,dword ptr [esi+4]

00000028 jae 00000041

0000002a mov edx,dword ptr [esi+edx*4+8]

ourResult += j;

0000002e add dword ptr ds:[00935100h],edx

for (int j = 0; j < array.Length - 1; /* empty */)

00000034 cmp edx,ecx

00000036 jl 00000025

for (int i = 0; i < array.Length; i++)

00000038 inc edi

00000039 cmp edi,dword ptr [esi+4]

0000003c jl 0000001A

0000003e pop esi

}

0000003f pop edi

00000040 ret

00000041 xor ecx,ecx

00000043 call 7252F933

00000048 int 3

The code is not the same as in the original cost, but differences are unimportant. Please note that I was not quite right that the code does not contain any calls: the Variant 2 contains bounds-checkingcode that calls a functionhandling thecase ofan invalid array index. In Variant 1, the JIT is able to prove that noout-of-bounds possible, so it simply removes bounds checking code. This bounds-checking cod e is also the reason why Variant 1 is slightly faster. However, as no out-of-bounds references really occur in Variant 2, it should not affect cache behavior.
So back to the original problem, I'm still mystified why the L2 cache behavior shown by VTune is so different. Any ideas?
0 Kudos
dsha2
Beginner
587 Views
I'm still here and waiting for ideas :)
0 Kudos
Reply