Short loop + high data use = instruction fetch stalls?

Mark_Rubelmann · ‎04-21-2010

I'm puzzled by what VTune is telling me about the following code. It's a very tight loop that walks through an image to generate a histogram. Pretty simple! Here's a snippet:

[cpp]const uint8_t* in = a_input;
for(uint32_t y = 0; y < a_height; y++)
{
    for(uint32_t x = 0; x < a_width; x++)
    {
        histo[(uint32_t)*in] = histo[(uint32_t)*in] + 1;
        in++;
    }
}
[/cpp]

Basically all of the time is spent in the line that increments the histogram bin. As I'd expect, it looks like the hardware prefetcher is doing a good job and L2 cache misses are very low. (I'm on a Core 2 Duo, btw.) The L1 data cache miss rate seems a bit high which sort of surprises me since the histogram is only 1K, which is a lot smaller than the L1 cache, right? Then the really surprising thing is that the tuning assistant is telling me the biggest bottleneck is instruction fetch stalls. Since this code is so short, locality can't possible be an issue and I would think it would stay put in the L1 instruction cache and be easy to fetch. Does high L1 data cache usage impact the L1 instruction cache that much? I thought they were separate entities. Does anyone have any insight into what's going on here? I'm pretty new to optimization in general and very new to VTune...

Thanks in advance,
Mark

Peter_W_Intel · ‎04-22-2010

Hi,

Two things -

You saw that the L1 data cache miss rate seems a bit high - since you convert data type from "uint8_t* to uint32_t*", so next instruction has no opportunity to prefecth data, usually 32bit data aligned. Please consider to declare variable "in" as data type uint32_t*.

You saw that the biggest bottleneck is instruction fetch stalls -here 64 bytes are for L1 instruction cache line size on each core of Core 2Duo, check your assembly code- totalinstructions' size of loopisgreater than 64 bytes?

Please check CPI (Cycles per Instruction) value incritical code fromVTune Analyzer result - 0.3~0.8 is good in my experience, ifthere is x87/ SSE instruction. Tuning Assistant always provides top 1 issue to you, but in general - we check CPI value first, then find root-cause.

Other tip is to use Intel? C++ compiler withoptimization switches to improve performance, like as data not aligned, etc.

Regards, Peter

Mark_Rubelmann · ‎04-22-2010

Hi Peter,

Thanks for the suggestions. I played around with it some more based on what you said and I managed to get some improvement in terms of what VTune is reporting. The actual execution time hasn't changed a bit however! Here are two other variants I tried.

The first one did a bit worse than the original, although like I said, the execution time remained the same. The instruction fetch stalls went from 1.0% to 1.1%, and the data cache misses went up a tiny bit. Other things (including CPI) remained about the same.

[cpp]uint32_t n = a_width * a_height;
const uint32_t* in = (const uint32_t*)a_input;
for(uint32_t i = 0; i < n; i++)
{
    uint8_t x = (uint8_t)(*in & 0xFF);
    histo = histo + 1;
    in = (const uint32_t*)((const uint8_t*)in + 1);
}
[/cpp]

The second one did much better. CPI dropped to 0.6, instruction fetch stalls went all the way down to 0.5%, and the data cache misses were back to what they were in the original.

[cpp]uint32_t n = a_width * a_height;
const uint32_t* in = (const uint32_t*)a_input;
for(uint32_t i = 0; i < n; i += 4)
{
    uint32_t x = *in;
    histo[x & 0xFF)]++;
    x = x >> 8;
    histo[(uint8_t)(x & 0xFF)]++;
    x = x >> 8;
    histo[(uint8_t)(x & 0xFF)]++;
    x = x >> 8;
    histo++;
    in++;
}[/cpp]

In all of these cases, the machine code for the loop is under 64 bytes. I tried making sure the first instruction of the loop was 32 byte aligned just in case that has any implications with fetching but it didn't change anything.

Here's one interesting thing however... After reading the tuning assistant's message again, I noticed that the ratio it's complaining about (InstructionFetchStall) is CYCLES_L1I_MEM_STALLED/CPU_CLK_UNHALTED.CORE*100 but the description makes it sound like the problem is with ITLB misses, not necessarily L1 instruction cache misses. This is getting well beyond my knowledge of the microarchitecture but after doing a bit of reading and headscratching, I've come to the conclusion that the only chance you've got of improving ITLB misses is good code locality, which this code certainly has. (Please correct me if I'm wrong!)

Anyway, I think I'm about at the point where I'll assume this is as fast as it's gonna get. I'm hopeful that my last version of the function will out-perform the original on other architectures even though they're all equal on the Core 2 I'm testing on. It's been a while since I've tried it, but I think I got a pretty nice speedup when doing that sort of thing on an older AMD box.

Thanks,
Mark

Peter_W_Intel · ‎04-22-2010

Hi Mark,

You did great job to use shift instead of data type's conversion in "second one" of example code. CPI is 0.6- it is a goodvalue for performance.

CYCLES_L1I_MEM_STALLED count indicates one of below reasons in code:

Instruction Fetch Unit cache misses
Instruction TLB misses
Instruction TLB faults

If you wants to detect L1 instruction cache misses, please use counter L1I_MISSES. (I guess that you have no such issue, because of thecode for the loop is under 64 bytes)

If you want to detect TLB misses, please use counter ITLB.MISSES (for L1 ITLBand L2 ITLB), that the executed code is spread over too many pages and cause many Instructions TLB misses.

In general,the developer should keep the control flow of code as less branch as possible (for example, no jump, limited size of code in the loop)

Regards, Peter