I have a question on the branch prediction in the sandy-bridge plateform.
Well, it is known that in the latest intel architectures there is a loop buffer after the decoder which turns down the pipeline's front-end if the loop size doesn't exceed a certain bound.
My question is what if the loop fits into the loop buffer and has a branch inside. Will the buffer be activated until the branch is encountered or is there any piping with the branch predictor ?
there can be branches inside a loop that is executed by the the loop stream detector.
The Intel 64 and IA-32 Architectures Optimization Reference Manual lists in section 2.1.2 the necessary conditions:
The loops with the following attributes qualify for LSD/micro-op queue replay:
Up to eight chunk fetches of 32-instruction-bytes
Up to 28 micro-ops (~28 instructions)
All micro-ops are also resident in the Decoded ICache
Can contain no more than eight taken branches and none of them can be a CALL or RET
Cannot have mismatched stack operations. For example, more PUSH than POP instructions.
For gnu compilers, I have found the option -funroll-loops --param max-unroll-times=4 effective with the Loop Stream implementation from Nehalem on. Small loops may be unrolled by 4 to good effect and still hit loop stream detector or micro-op cache (even when there is an if block), eliminating a need for more aggressive unrolling.