Right now I am building a model of our soft execution on Intel's CPUs. Part of this model should be dedicated to CODE part. Majority of it available of the box in VTune, but I would like to have a bit more. I hope somebody here can give advice.
- What is the CODE traffic in the app. In other words how much code in MB/S is transferred over buses, memory channels, QPI, LLC<->MEM, L2<->LLC, L1<->L2. Is it possible in general? What event should be used? Right now I am using OFFCORE_RESPONSE.DEMAND_CODE_RD.LLC_MISS.ANY_RESPONSE_0 OFFCORE_RESPONSE.DEMAND_CODE_RD.LLC_HIT.ANY_RESPONSE_0 in order to get memory and LLC CODE traffic. Are there any options for the rest? How reliable are L2_TRANS.CODE_RD, L2_RQSTS.CODE_RD_MISS, L2_RQSTS.CODE_RD_HIT. Is it possible to use them for such purpose? Other options?
- How caches (LLC and L2) are portioned between CODE and DATA. Any ideas?
I am not aware of any directed testing of the performance counters for instruction cache miss traffic on recent Intel processors, so you may have to figure out how to build some of your own test cases. It looks like you have identified the relevant core counters -- if you can figure out how to build code with predictable memory access patterns you should be able to generate predictable cache miss rates at the L1 and L2 levels to test these counters. The situation may be more complex at the LLC level, as I discuss below.
The L2 and LLC are unified caches with no explicit partitioning between code and data.
Intel provides few details on cache replacement policies, but it seems likely that L1 instruction cache misses will place cache lines containing instructions into the L2 cache using the same LRU mechanism as data cache lines. There is no direct penalty if instruction cache lines get evicted from the L2 -- they are allowed to stay in the L1 instruction cache. If the line is evicted from the L1 instruction cache due to an L1 capacity or conflict overflow, it will likely remain in the LLC, so the added penalty for not forcing the line to stay in the L2 cache is not terribly large.
The situation is different with the LLC since it is inclusive. It a cache line is evicted from the LLC, it must also be evicted from all of the L2 caches and all of the L1 instruction and L1 data caches that share that LLC. If the LLC evicts actively used instruction cache lines, then the processor will likely stall for a long time while the instruction cache lines are brought back into the L1 instruction cache from memory. To prevent the LLC from evicting lines (either instruction or data) in active use by the L1 and/or L2 caches, it is possible to send "hints" to the LLC that certain lines are in active use and should not not be allowed to "age" to the LRU position. A fair number of papers in this area are from Intel, so it is not inconceivable that Intel may have implemented mechanisms that reduce the turnover of active instruction cache lines in the LLC. This would not amount to anything resembling static partitioning, but could result in unanticipated cache hit/miss behavior.
Volume 3 of the Intel SW Developers Manual discusses a feature of future Intel Xeon processors called "Cache Allocation Technology" (section 17.15 of document 325384-052). I don't see any specific references to instruction vs data caching in this discussion, but it could certainly be used to provide priority to instruction-cache-miss-heavy applications over data-cache-miss-heavy applications, which would result in an apparent bias toward instruction caching in the LLC. (Of course it could be configured in the opposite way as well.)