- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello all,
I have a question on the branch prediction in the sandy-bridge plateform.
Well, it is known that in the latest intel architectures there is a loop buffer after the decoder which turns down the pipeline's front-end if the loop size doesn't exceed a certain bound.
My question is what if the loop fits into the loop buffer and has a branch inside. Will the buffer be activated until the branch is encountered or is there any piping with the branch predictor ?
Zakaria
I have a question on the branch prediction in the sandy-bridge plateform.
Well, it is known that in the latest intel architectures there is a loop buffer after the decoder which turns down the pipeline's front-end if the loop size doesn't exceed a certain bound.
My question is what if the loop fits into the loop buffer and has a branch inside. Will the buffer be activated until the branch is encountered or is there any piping with the branch predictor ?
Zakaria
Link Copied
4 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Zakaria,
there can be branches inside a loop that is executed by the the loop stream detector.
The Intel 64 and IA-32 Architectures Optimization Reference Manual lists in section 2.1.2 the necessary conditions:
Kind regards
Thomas
there can be branches inside a loop that is executed by the the loop stream detector.
The Intel 64 and IA-32 Architectures Optimization Reference Manual lists in section 2.1.2 the necessary conditions:
The loops with the following attributes qualify for LSD/micro-op queue replay:
Up to eight chunk fetches of 32-instruction-bytes
Up to 28 micro-ops (~28 instructions)All micro-ops are also resident in the Decoded ICache
Can contain no more than eight taken branches and none of them can be a CALL or RET
Cannot have mismatched stack operations. For example, more PUSH than POP instructions.
Kind regards
Thomas
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Thomas,
Sorry i forgot to check the manual.
Well, up to 8 branches, this is just wonderful :)
Thank you a lot.
Best regards,
Zakaria
Sorry i forgot to check the manual.
Well, up to 8 branches, this is just wonderful :)
Thank you a lot.
Best regards,
Zakaria
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Zakaria,
you are welcome. The Optimization Reference Guide is full of gems, but they are very easy to miss.
Kind regards
Thomas
you are welcome. The Optimization Reference Guide is full of gems, but they are very easy to miss.
Kind regards
Thomas
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
One I missed is the question of whether there is any distinction in Loop Stream detection from Nehalem to Sandy Bridge. The micro-op cache on Sandy Bridge is intended to supplement Loop Stream detection, as I understand it.
For gnu compilers, I have found the option -funroll-loops --param max-unroll-times=4 effective with the Loop Stream implementation from Nehalem on. Small loops may be unrolled by 4 to good effect and still hit loop stream detector or micro-op cache (even when there is an if block), eliminating a need for more aggressive unrolling.
For gnu compilers, I have found the option -funroll-loops --param max-unroll-times=4 effective with the Loop Stream implementation from Nehalem on. Small loops may be unrolled by 4 to good effect and still hit loop stream detector or micro-op cache (even when there is an if block), eliminating a need for more aggressive unrolling.
Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page