Hi,
I am generating a synthetic C benchmark aimed at causing a large number of instruction fetch misses via the following Python script:
#!/usr/bin/env python import tempfile import random import sys if __name__ == '__main__': functions = list() for i in range(10000): func_name = "f_{}".format(next(tempfile._get_candidate_names())) sys.stdout.write("void {}() {{\n".format(func_name)) sys.stdout.write(" double pi = 3.14, r = 50, h = 100, e = 2.7, res;\n") sys.stdout.write(" res = pi*r*r*h;\n") sys.stdout.write(" res = res/(e*e);\n") sys.stdout.write("}\n") functions.append(func_name) sys.stdout.write("int main() {\n") sys.stdout.write("unsigned int i;\n") sys.stdout.write("for(i =0 ; i < 100000 ;i ++ ){\n") for i in range(10000): r = random.randint(0, len(functions)-1) sys.stdout.write("{}();\n".format(functions)) sys.stdout.write("}\n") sys.stdout.write("}\n")
What the code does is simply generating a lot of randomly named dummy functions that are in turn called in random order in main()
. I am compiling the resulting code with gcc 4.8.5 under CentOS 7 with -O0
. The code is running on a dual socket machine fitted with 2x Intel Xeon E5-2630v3 (Haswell architecture).
What I am interested in is understanding instruction-related counters reported by perf. In particular, I am observing the following counters with perf stat
:
- instructions
- L1-icache-load-misses (instruction fetches that miss L1, aka r0280 on Haswell)
- r2424, L2_RQSTS.CODE_RD_MISS (instruction fetches that miss L2)
- rf824, L2_RQSTS.ALL_PF (all L2 hardware prefetcher requests, both code and data)
I first profiled the code with all hardware prefetchers disabled in the BIOS, i.e.
- MLC Streamer Disabled
- MLC Spatial Prefetcher Disabled
- DCU Data Prefetcher Disabled
- DCU Instruction Prefetcher Disabled
and the results are the following (process is pinned to first core of second CPU and corresponding NUMA domain, but I guess this doesn't make much difference):
perf stat -e instructions,L1-icache-load-misses,r2424,rf824 numactl --physcpubind=8 \ --membind=1 /tmp/code Performance counter stats for 'numactl --physcpubind=8 --membind=1 /tmp/code': 25,108,610,204 instructions 2,613,075,664 L1-icache-load-misses 5,065,167,059 r2424 17 rf824
Considering the figures above, I cannot explain such a high number of instruction fetch misses in L2. I have disabled all prefetchers, and L2_RQSTS.ALL_PF confirms so. But why do I see twice as much the number of instruction fetch misses in L2 than in L1i? In my (simple) mental processor model, if an instruction is looked up in L2, it must have necessarily been looked up in L1i before. Clearly I am wrong, what am I missing?
I then tried to run the same code with all the hardware prefetchers enabled, i.e.
- MLC Streamer Enabled
- MLC Spatial Prefetcher Enabled
- DCU Data Prefetcher Enabled
- DCU Instruction Prefetcher Enabled
and the results are the following:
perf stat -e instructions,L1-icache-load-misses,r2424,rf824 numactl --physcpubind=8 \ --membind=1 /tmp/code Performance counter stats for 'numactl --physcpubind=8 --membind=1 /tmp/code': 25,109,877,626 instructions 2,599,883,072 L1-icache-load-misses 5,054,883,231 r2424 908,494 rf824
Now, L2_RQSTS.ALL_PF seems to indicate that something more is happening and although I expected the prefetcher to be a bit more aggressive, I imagine that the instruction prefetcher is severely put to the test due to the jump-intensive type of workload and data prefetcher has not much to do with this kind of workload. But again, L2_RQSTS.CODE_RD_MISS is still too high with the prefetchers enabled.
So, to sum up, my question is:
With hardware prefetchers disabled, L2_RQSTS.CODE_RD_MISS seems to be much higher than L1-icache-load-misses. Even with hardware prefetchers enabled, I still cannot explain it. What is the reason behind such a high count of L2_RQSTS.CODE_RD_MISS compared to L1-icache-load-misses?
Thank you very much for your help,
Marco