Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.
Announcements
FPGA community forums and blogs have moved to the Altera Community. Existing Intel Community members can sign in with their current credentials.

examine L1 instruction cache

eantyshev
Beginner
6,388 Views
Greetings!

I has been encouraged by some people to measure the effect of L1 instruction cache.
(is known to be of 32kb on core2duo machine).
Methodology which I chose:

1) generate 100kb of uniform code, like

start:
jmp ebx (ebx = $ + 100kb - codesize)
...
dec eax
dec eax
dec eax (this instruction is known to have 1 byte machine code)
...
cmp eax, 0
jge start

2) Place 2*10^9 to eax
3) Calculate jump address to place in ebx, so the amount of code executed is f.e. 10kb
4) Execute:)

In my unqualified opinion, this should lead to that
function would execute (roughly) the same amount of instructions (2*10^9),
which are localized in variable-sized code block.


Therefore, I expect an increase in execution time when codesize exceeds 32kb -- L1 instruction cache.

Sad, but this methodology doesn't work:(
Times are the same for codesize range from 1 to 64kb

Perhaps, I am totally wrong, and didn't catch what L1I cache is about...
Or this may be because of some processor-internal optimizations, which shrink the code size.

Best regards,
Evgeny.
0 Kudos
15 Replies
Patrick_F_Intel1
Employee
6,387 Views
Hello Evgeny,
There is a code prefetcher which fetches instructions from the L2 into the L1i.
This may be what you are seeing (where the is no change in performance for different size of codes).

The core2duo cpu can retire up to 4 instructions per cycle (IPC).

Can you do a rdtsc at the start of the code and a rdtsc at the end of the code and then report the difference in clockticks?
It would be helpful if you did this for 4K, 8K, 16k, 24k, 32k, 40k, 64k.
Then we could computeIPC for each size where IPC='instructions'/'difference in rdtsc cycles'.

Also, are you doing multiple loops over the code?
The first time through the code then the code won't be in the L1i so you'd need to do something like 100 loops (just to reduce the impact of the first time through the loop).

Using VTune Amplifier we could measure with events the L1i misses, L1i prefetches, L2 hits, etc.

So I guess I'm saying I need more detail to be able to say exactlywhat is going on.
Pat
0 Kudos
eantyshev
Beginner
6,387 Views

Thanks for your reply, Patrick!


CPI rate (got from vtune) seems to be independent from codesize and is about 1.00

Intersting results I get for L1I* performance (same number of "dec eax" instructions):

Codesize CPU_CLK_UNHALTED.CORE CYCLES_L1I_MEM_STALLED L1I_MISSES L2_IFETCH.SELF.MESI

10kb 200_300_000_000 426_300_000 6_300_000 11_100_000

20kb 200_400_000_000 775_600_000 18_900_000 42_140_000

40kb 201_226_000_000 35_160_000_000 1_570_000_000 3_131_000_000

100kb 201_500_000_000 35_410_000_000 1_548_000_000 3_123_000_000

These stats signify, that multiple L1I misses really take place when codesize > 32kb, but why so little increase in execution time?

Instruction starvation (CYCLES_L1I_MEM_STALLED/CPU_CLK_UNHALTED) grows from 0.002 to 0.17, and no decrease in application performance!

Are L1I_MISSES so cheap: less than 1 cycle per event?
And why L2_IFETCH events don't complement to cycles count?

Any considerations are highly valuable to me!

Thanks,
Evgeny
0 Kudos
Patrick_F_Intel1
Employee
6,388 Views
Great data collection Evgeny!
Just to recap... the CPI is about 1.0. So you are getting probably about peak performance of 1 instruction retired per cycle.

First a little background...
The 'dec eax' produces a dependency on the contents of eax. So only 1 'dec eax' can be executed per cycle.
The peak instructions retired per cycle is 4 on core 2 duo.
Each instruction gets decoded into 1 or more uops (micro-operations) and dispatched to a port that can handle that type of uop.
The 'dec' instruction gets executed in the ALU execution unit. There are 3 ALU units on core 2 duo.
You probably could get 3 instructions per cycle if you changed the code from:
dec eax
To:
dec eax
dec ecx
dec edx

You can see a block diagram of thesandy bridge processor in the optimization guide (June 2011, figure 2-1, page 40, www.intel.com/Assets/PDF/manual/248966.pdf) or the core cpu block diagram in figure 2-2.
These block diagrams show how the ports and execution units are logically layed out in the processor.

All of the above doesn't really answer your questions but hopefully it is useful to you or others.

You ask: "Are L1i misses so cheap? Less than 1 cycle per event?"
L1i misses are not cheap but how expensive a miss is 'depends'.
In the case of your example, you've coded probably the best case scenario for L1i misses.
That is, the instructions are just 1 byte each (so 64 instructions per cache line) and there are no branches so the prefetcher works perfectly.
An L1 miss+L2 hit takes 10 cycles but you can have multiple misses outstanding per cycle. This 'multiple outstanding misses per cycle' reduces the effective latency of a miss.
Again, we're getting about 1 instruction per cycle so to do a whole cache line takes 64 cycles (for your example).

The L1i_missesare low probably because the prefetcher has already brought the instructions in the L1i before a 'miss' is generated.

For the L2_ifetch for the 10k & 20k case is about what we'd expect. The L2_ifetches are about 74x-284x lower for 'in icache case' than the 'out of icache case'.
The L2_ifetch is aboutwhat we'd expect for the 40k & 100k cases.
You ran for 200billion cycles and the same number of instructions.
The number of instructions in the 3.1 billion L2 misses is 3.1*64= 198 billion instructions.
So all of the instructions for the 40k & 100k are fetched from the L2.

Does this make sense?
Pat
0 Kudos
eantyshev
Beginner
6,387 Views
Surely it makes perfect sense, Patrick!

Your response turned me towards following decisions:
1) I need longer instructions: just to make instruction fetches more frequent. Now I tend to start using FPU.
The guide mentions SSE2 as an example when performance suffer from instruction starvation, but I'm not quite familiar with it.

2) The only way I see to reduce instruction prefetch effeciency are calculated jumps (my primary goal was to write application to examine L1I cache, and prefetcher ruined the business:)).

Best regards,
Evgeny.
0 Kudos
Patrick_F_Intel1
Employee
6,388 Views

If you want to maximize the number of L1i misses, you might try page aligning a jmp to next page, like:

align 4096 bytes
start: jmp next1
nop
nop

align 4096 bytes
next1: jmp next2
nop

align 4096 bytes
jmp start

The idea is that there are only so many ways (L1i size/4096)in theL1i.
So, for a 32 KB L1i, you can hold 8 unique lines where the addresses are multiples of 4KB.
The pseudo LRU algorithm controls exactly which cache line gets replaced.
This '4kb jmp' tactic willprobably show most expensive L1i cache miss since the miss may include a page miss and TLB miss.
You need probably at least 16 page-aligned jmps.
Don't exceed the size of the L2 if you want the L1i cache misses to come from the L1i misses come the L2.
Pat

0 Kudos
eantyshev
Beginner
6,388 Views
Hello, Patrick!

I hope you remember the main idea of our discussion.
The cornerstone of the behavior of my program was L1I prefetching,
which gave me no performance degradation depending on code size.

I've recently succeeded in randomized jumping through the code.

It looked like 64kb of the following code units:
$exec_loop:
....

jmp edx /* jump to random unit */
ALIGN 16 /* 1 unit -- 0x40 bytes */
...

I thought it would 'neutralize' the prefetcher: next jump address is known a few instructions before.
And I should say it does: timing shows 10% degradation in performance between code sizes from 10kb to 40kb.

But L1I_MISSES counter increases in million times, from 4_000_000 to 1_500_000_000_000.
And I beleive something else could be done in order to make performance degrading more dramatically.

Every consideration is welcome!

Best regards,
Evgeny.
0 Kudos
Patrick_F_Intel1
Employee
6,388 Views

Hello Evgeny,
I'm not sure why the L1i_misses increases so much.
Does instructions retired and CPU_CLK_UNHALTED.CORE stay about the same?
Pat

0 Kudos
eantyshev
Beginner
6,388 Views
Hello!

First, I should apologize for previous post numbers: more accurate measurement has shown that L1I_MISSES grows 100 times, not million.

The counters are below:

Codesize CPU_CLK_UNHALTED.CORE INST_RETIRED.ANY CYCLES_L1I_MEM_STALLED L1I_MISSES UOPS_RETIRED.ANY
10kb 2,15722E+11 38106000000 445900000 8400000 58198000000
20kb 2,1611E+11 38138000000 1,065E+09 49000000 58156000000
40kb 2,2624E+11 37946000000 2,619E+10 1901900000 60130000000
50kb 2,34768E+11 38084000000 5,032E+10 3582600000 60886000000


From this counters I make conclusions:
1) Instructions retired count is almost the same -- effect of performance degradation is really L1I-related.
2) Execution time is proportional to CPU_CLK_UNHALTED.CORE, which increases by 10%. This cycles are probably wasted on L1I misses handling.
3) L1I_MISSES increases in hundred times, but that affects performance moderately. Seems like 1 L1 instruction miss is cheap (like 1-2 extra instructions). BTW, you mentioned it weights 10 cycles, does it fit?

Do you think the reason for low performance degradation is L1 miss cheapness?
Or could it be that prefetcher still is able to calculate jump address?

Looking for your response.

Best regards,
Evgeny.
0 Kudos
Patrick_F_Intel1
Employee
6,388 Views

Hello Evgeny,
The CPI for the 10kb case is about 5.66 clockticks/instruction_retired.
The CPI for the 50kb case is about6.16 clockticks/instruction_retired.
So the CPI increases about 1.089x (about 9%) due to more L1i_misses.

The L2 latency is 10 clockticks. That is, it takes 10 clockticks from the time a miss is issued until the cache line is fetched from the L2 and ready to use.

For the 50kb case, you have 10.6 instructions_retired/L1i_miss.
So about every 10th instruction you miss in the L1i.
If I take make a spreadsheet where the 10 instructiontakes 5.66 clocks except the 10 instruction takes 5.66+10 clocks, then I get an overall CPI for the 10 instructions of 6.66 clocks/inst.

The above assumes that only the jump instruction gets a miss and that all the instructions that you are using to calculate a random jump location fit on one cacheline. If it takes more than 1 cacheline for the 'calc next jump location' code then you will get more misses.
There is yet another mechanism which can act like a prefetcher: the out of order exection.
For out of order execution, the CPU speculatively executes instructions ahead of the current location.
There are usually multiple instructions in flight.
Some of these instructions may cause L1i misses so thecode will be getting fetched from the L2 while other instructions already in flight are exectued.
This reduces the 'effective' latency from the worst case 10 cycle L2 latency.

If i go back to my simple spreadsheet and put in a 5 clocktick L1i_miss penalty then I compute an overall CPI (for the 10 instructions at 5.66 clocks + 5 clocks L1i_miss penalty) then I get a CPI of 6.16.

So it looks like we are getting cache misses every 10 instructions and the 'effective' latency of the miss is about 5 clockticks.
Between the code prefetcher (which you can't disable) and the out-of-order execution (which you can't disable), trying to create an L1i_miss experiment can be quite difficult.

Does this make sense?
Pat

0 Kudos
Patrick_F_Intel1
Employee
6,388 Views

Hello Evgeny,
The CPI for the 10kb case is about 5.66 clockticks/instruction_retired.
The CPI for the 50kb case is about6.16 clockticks/instruction_retired.
So the CPI increases about 1.089x (about 9%) due to more L1i_misses.

The L2 latency is 10 clockticks. That is, it takes 10 clockticks from the time a miss is issued until the cache line is fetched from the L2 and ready to use.

For the 50kb case, you have 10.6 instructions_retired/L1i_miss.
So about every 10th instruction you miss in the L1i.
If I take make a spreadsheet where the 10 instructiontakes 5.66 clocks except the 10 instruction takes 5.66+10 clocks, then I get an overall CPI for the 10 instructions of 6.66 clocks/inst.

The above assumes that only the jump instruction gets a miss and that all the instructions that you are using to calculate a random jump location fit on one cacheline. If it takes more than 1 cacheline for the 'calc next jump location' code then you will get more misses.
There is yet another mechanism which can act like a prefetcher: the out of order exection.
For out of order execution, the CPU speculatively executes instructions ahead of the current location.
There are usually multiple instructions in flight.
Some of these instructions may cause L1i misses so thecode will be getting fetched from the L2 while other instructions already in flight are exectued.
This reduces the 'effective' latency from the worst case 10 cycle L2 latency.

If i go back to my simple spreadsheet and put in a 5 clocktick L1i_miss penalty then I compute an overall CPI (for the 10 instructions at 5.66 clocks + 5 clocks L1i_miss penalty) then I get a CPI of 6.16.

So it looks like we are getting cache misses every 10 instructions and the 'effective' latency of the miss is about 5 clockticks.
Between the code prefetcher (which you can't disable) and the out-of-order execution (which you can't disable), trying to create an L1i_miss experiment can be quite difficult.

Does this make sense?
Pat

0 Kudos
eantyshev
Beginner
6,388 Views
Thanks again for your kind attention, Patrick!

I shall try changing several parameters in my program to become more sure about your vision
of what's happening.


Would you please answer one more question concerning L1I experimenting.
(This was my original attempt to catch an effect.)

There are 2 processes, localized at the same core, but in different hyperthreads.
1st process loops over 20kb of code, and I measure it's performance.

2nd process loops over variable amount of code, ranged from 5 to 20kb.

As far as these processes share the same L1I cache simultaneously, the performance
of former one should decrease.

But actually, it doesn't. Effect appeared to be less than measurement error, under 5%

What could be the cause of such behavior?

Very much grateful again,
Evgeny.
0 Kudos
Patrick_F_Intel1
Employee
6,388 Views

Hello Evgeny,
This result is much harder to say exactly what is going on.
I'll assume that another way of saying what you see is:
1) when I ran on 1 thread I get performance of X (say 100 operations/sec)
2) when I ran on 2 threads (1 per logical cpu) I get performance near 2*X (so, with the above example about 200 operations/sec)
3) why do we get about 2*X performance when the 2 threads compete for the same L1i cache?

It seems that the code prefetchers and out-of-order execution can fetch the missing lines from L2 quickly enough to avoid delays.
Recall that the L2 miss latency is only 10 clocks and you had a CPI of about 5 so you would only have to have a couple of instructions in flight (speculatively being executed) in order to completely hide the L2 miss latency.

This scenario also demonstratesone ofthe wayshyper-threading can really boost performance. When 1 of the threads on acore stalls due to an L1i miss the other thread can continue executing.
One of the biggest stalls is reading memory.
Hyper-threading can help software with lots of random memory accesses, frequently seen in server software.
Random memory accesses expose the latency of the memory since the prefetchers are defeated by the randomness.
When 1 thread stalls due to a last level cache miss (waiting on memory to respond), the other hyper-thread frequently has work to do and is able to run.
Make sense?
Pat

0 Kudos
eantyshev
Beginner
6,388 Views

Hello, Patrick!

Concerning your answer to my last question:

L1I_MISSES count increases propotionally to P(L1I miss probability) = MAX((codesize - L1Icachesize)/codesize, 0)
But performance isn't affected: this means that prefetcher does its work in parallel with main execution.
it seems the prefetcher is loaded more when 2 threads are running at single core (comparing to 1), butthe prefetcheris powerful enough to handle both, is it correct?



I'm still investigating situation from my post N6 (random jumping in variable amount of code).
There are a few questions arised, and I have no person around to clarify this.

1) CPU_CLK_UNHALTED counter fits the formula:
CPU_CLK_UNHALTED = CPU_CLK_UNHALTED(0) + Pmiss * BR_INST_RETIRED.MISPRED * 20cycles,
where CPU_CLK_UNHALTED(0) -- cycles it took whe codesize is under 32kb (no L1I misses),
Pmiss = MAX((codesize - L1Icachesize)/codesize, 0)

Formula expresses the fact that unpredicted misses induce 20cycle delay. Why could it be 20 cycles, not 10?

2) CYCLES_L1I_MEM_STALLED = 15 * L1I_MISSES (approx.)
I guess these counters should be propotional, but why 15?

3) When codesize in my prog is below 32kb and there is almost no L1 misses, I see that
RS_UOPS_DISPATCHED.CYCLES_NONE constitutes 1/6 of CPU_CLK_UNHALTED.CORE
Why it's not 0, I wonder?

(I've checked it when no jumps occures, and code is executed sequentially (like in my first prog from post N1), RS_UOPS_DISPATCHED.CYCLES_NONE was small then.)

Now I havethe only idea:frequent branch mispredictions, even if they don't induce L1 miss, it takes some cycles to clear up the state of the pipeline.
But RESOURCE_STALLS.BR_MISS_CLEANUP is about 1/1000 of CPU_CLK_UNHALTED,
and it's likely for some other reason. Why so, how do you think?

If you are interested, I will provide any necessary information: it's important to me to resolve these issues.

Thank you in advance,
Evgeny.

0 Kudos
Patrick_F_Intel1
Employee
6,388 Views

Hello Evgeny,
For your question: "it seems the prefetcher is loaded more when 2 threads are running at single core (comparing to 1), but the prefetcher is powerful enough to handle both, is it correct?"
Yes, the prefetcher can keep up with 2 threads.

For 1), 2) and 3), I'll have to get back to you. I'm checking with someone else but I think the answer is going to be "it depends on the code". I hope to be able to give you a more detailed answer than 'it depends'.

0 Kudos
Patrick_F_Intel1
Employee
6,388 Views

Here is some followup.

For 1) A branch mispredict may or may not cause an L1i miss but the mispredict will cause a pipeline flush (as the cpu backs out the changes done by the errant code path). The penalty depends to some degree on the code and a 15-20 clocktick penalty is typical.

Are branch mispredicts > than L1i_misses? You may be 'charging' mispredicts for some mispredicts that got L1i_misses and other mispredicts which hit in the L1i.

You might want to refer to Intel 64 and IA-32 Architectures Optimization Reference Manual, http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-optimization-manual.html, section B.2.1.1 which discusses branch mispredicts (but on the nehalem chips). You can search the pdf for other references to 'mispredict'.

For 2) CYCLES_L1I_MEM_STALLED counts the clocks for the L1i miss and the TLB misses (if any). Part of the cost of the L1i_miss will be the branch mispredict time too. The L2 misslatency is 10 cycles.

If you are trying to do a stall cycle accounting you should read section B.2 of the above manual.

For 3) RS_UOPS_DISPATCHED.CYCLES_NONE count the clockticks were no UOPs are dispatched. The stall cycles can be for many reasons. Branch mispredicts will stall the cpu. Memory fetches can stall the cpu (even though that is probably not the case here). There are many, many reasons and the cpu tries its hardest to avoid the stalls. You might have long instructions waiting for the only port that can handle that type of instruction (but this is just speculation).

If you read section B.3 of the manual you will see (for Sandy Bridge) a pretty in-depth discussion of stalls and how to do stall cycle accounting.

Hope this helps.
Pat
0 Kudos
Reply