From my experiments, I found there are only 6 stack slots for user code. Whether the return stack buffer has 2 parts: one for kernel and one for user?
As far as I know the current measured throughput for jumps and branches varies between one branch per clock cycle and one branch per two clock cycles for jumps and predicted taken branches.
Although, predicted not taken branches have an even higher throughput of up to two branches per clock cycle, the high throughput for taken branches of one per clock was observed for up to 128 branches with no more than one branch per 16 bytes of code.
Please let me know about your specific experiments so we could provide your with a more specific response.
You could refer to this third parties website for more information http://www.agner.org/optimize/microarchitecture.pdf http://www.agner.org/optimize/microarchitecture.pdf (page 137).10.7 Stack engine. *
*Other names and brands may be claimed as the property of others.