topic [Sandy-bridge loop buffer] in Software Tuning, Performance Optimization & Platform Monitoring

[Sandy-bridge loop buffer]

zakaria-bendifallah — Fri, 16 Dec 2011 10:03:45 GMT

Hello all,

I have a question on the branch prediction in the sandy-bridge plateform.

Well, it is known that in the latest intel architectures there is a loop buffer after the decoder which turns down the pipeline's front-end if the loop size doesn't exceed a certain bound.

My question is what if the loop fits into the loop buffer and has a branch inside. Will the buffer be activated until the branch is encountered or is there any piping with the branch predictor ?

Zakaria

[Sandy-bridge loop buffer]

Thomas_W_Intel — Fri, 16 Dec 2011 13:40:22 GMT

Zakaria,

there can be branches inside a loop that is executed by the the loop stream detector.

The Intel 64 and IA-32 Architectures Optimization Reference Manual lists in section 2.1.2 the necessary conditions:

The loops with the following attributes qualify for LSD/micro-op queue replay:
Up to eight chunk fetches of 32-instruction-bytes
Up to 28 micro-ops (~28 instructions)
All micro-ops are also resident in the Decoded ICache
Can contain no more than eight taken branches and none of them can be a CALL or RET
Cannot have mismatched stack operations. For example, more PUSH than POP instructions.

Kind regards
Thomas

[Sandy-bridge loop buffer]

zakaria-bendifallah — Fri, 16 Dec 2011 14:51:20 GMT

Hi Thomas,

Sorry i forgot to check the manual.
Well, up to 8 branches, this is just wonderful :)

Thank you a lot.

Best regards,
Zakaria

[Sandy-bridge loop buffer]

Thomas_W_Intel — Fri, 16 Dec 2011 15:01:45 GMT

Zakaria,

you are welcome. The Optimization Reference Guide is full of gems, but they are very easy to miss.

Kind regards
Thomas

[Sandy-bridge loop buffer]

TimP — Mon, 19 Dec 2011 12:21:37 GMT

One I missed is the question of whether there is any distinction in Loop Stream detection from Nehalem to Sandy Bridge. The micro-op cache on Sandy Bridge is intended to supplement Loop Stream detection, as I understand it.
For gnu compilers, I have found the option -funroll-loops --param max-unroll-times=4 effective with the Loop Stream implementation from Nehalem on. Small loops may be unrolled by 4 to good effect and still hit loop stream detector or micro-op cache (even when there is an if block), eliminating a need for more aggressive unrolling.