Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.

[Sandy-bridge loop buffer]

zakaria-bendifallah
1,137 Views
Hello all,


I have a question on the branch prediction in the sandy-bridge plateform.

Well, it is known that in the latest intel architectures there is a loop buffer after the decoder which turns down the pipeline's front-end if the loop size doesn't exceed a certain bound.

My question is what if the loop fits into the loop buffer and has a branch inside. Will the buffer be activated until the branch is encountered or is there any piping with the branch predictor ?


Zakaria
0 Kudos
4 Replies
Thomas_W_Intel
Employee
1,137 Views
Zakaria,

there can be branches inside a loop that is executed by the the loop stream detector.

The Intel 64 and IA-32 Architectures Optimization Reference Manual lists in section 2.1.2 the necessary conditions:

The loops with the following attributes qualify for LSD/micro-op queue replay:

Up to eight chunk fetches of 32-instruction-bytes
Up to 28 micro-ops (~28 instructions)

All micro-ops are also resident in the Decoded ICache

Can contain no more than eight taken branches and none of them can be a CALL or RET

Cannot have mismatched stack operations. For example, more PUSH than POP instructions.


Kind regards
Thomas
0 Kudos
zakaria-bendifallah
1,137 Views
Hi Thomas,

Sorry i forgot to check the manual.
Well, up to 8 branches, this is just wonderful :)

Thank you a lot.

Best regards,
Zakaria
0 Kudos
Thomas_W_Intel
Employee
1,137 Views
Zakaria,

you are welcome. The Optimization Reference Guide is full of gems, but they are very easy to miss.

Kind regards
Thomas
0 Kudos
TimP
Honored Contributor III
1,137 Views
One I missed is the question of whether there is any distinction in Loop Stream detection from Nehalem to Sandy Bridge. The micro-op cache on Sandy Bridge is intended to supplement Loop Stream detection, as I understand it.
For gnu compilers, I have found the option -funroll-loops --param max-unroll-times=4 effective with the Loop Stream implementation from Nehalem on. Small loops may be unrolled by 4 to good effect and still hit loop stream detector or micro-op cache (even when there is an if block), eliminating a need for more aggressive unrolling.
0 Kudos
Reply