Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.
Announcements
FPGA community forums and blogs on community.intel.com are migrating to the new Altera Community and are read-only. For urgent support needs during this transition, please visit the FPGA Design Resources page or contact an Altera Authorized Distributor.

[Sandy-bridge loop buffer]

zakaria-bendifallah
1,550 Views
Hello all,


I have a question on the branch prediction in the sandy-bridge plateform.

Well, it is known that in the latest intel architectures there is a loop buffer after the decoder which turns down the pipeline's front-end if the loop size doesn't exceed a certain bound.

My question is what if the loop fits into the loop buffer and has a branch inside. Will the buffer be activated until the branch is encountered or is there any piping with the branch predictor ?


Zakaria
0 Kudos
4 Replies
Thomas_W_Intel
Employee
1,550 Views
Zakaria,

there can be branches inside a loop that is executed by the the loop stream detector.

The Intel 64 and IA-32 Architectures Optimization Reference Manual lists in section 2.1.2 the necessary conditions:

The loops with the following attributes qualify for LSD/micro-op queue replay:

Up to eight chunk fetches of 32-instruction-bytes
Up to 28 micro-ops (~28 instructions)

All micro-ops are also resident in the Decoded ICache

Can contain no more than eight taken branches and none of them can be a CALL or RET

Cannot have mismatched stack operations. For example, more PUSH than POP instructions.


Kind regards
Thomas
0 Kudos
zakaria-bendifallah
1,550 Views
Hi Thomas,

Sorry i forgot to check the manual.
Well, up to 8 branches, this is just wonderful :)

Thank you a lot.

Best regards,
Zakaria
0 Kudos
Thomas_W_Intel
Employee
1,550 Views
Zakaria,

you are welcome. The Optimization Reference Guide is full of gems, but they are very easy to miss.

Kind regards
Thomas
0 Kudos
TimP
Honored Contributor III
1,550 Views
One I missed is the question of whether there is any distinction in Loop Stream detection from Nehalem to Sandy Bridge. The micro-op cache on Sandy Bridge is intended to supplement Loop Stream detection, as I understand it.
For gnu compilers, I have found the option -funroll-loops --param max-unroll-times=4 effective with the Loop Stream implementation from Nehalem on. Small loops may be unrolled by 4 to good effect and still hit loop stream detector or micro-op cache (even when there is an if block), eliminating a need for more aggressive unrolling.
0 Kudos
Reply