(1) In 11.0 version of ICC generates an aligned (to 128) frame for main. This is the intentional behavior and is done for performance reasons though not for this specific example. This where all the uses of %rbp arise (it is used to save the value of %rsp before alignment). If you are confused with this extra alignment, please put your code into a function other than main. Consider the prologue for main with 10.1:
subq $4198408, %rsp
vs 11.0 prologue:
movq %rsp, %rbp
andq $-128, %rsp
subq $4198400, %rsp
(2) The -fno-builtin option causes the compiler to not expand intrinsics code inline. The code in the example doesnt make use of any intrinsics which might suggest that this option should have no effect here. Interestingly but it does affect the way the pattern of setting memory to zero is recognized. It may be considered a "bug" in the sense that the behavior is not the one as expected, but I don't fell too strong about it. It is important that both variants are correct to the matter of what the semantics of the option is and they both look reasonably adequate in terms of performance. Generally it is true that the use of -fno-builtin would result in a smaller though slower code, but it is incorrect assumption that the code will indeed be smaller. If you are really interested in code size vs. performance, please use option -Os.
(3) The pushq %rsi code that you refer in (d) has nothing to do to parameter passing for routine __sti__$E. It is just an easy way to adjust the stack pointer by 8 bytes to make sure that it is properly aligned to 16-byte boundary at the subsequent call, to conform to x86-64 ABI. The popq %rcx is its counterpart in function epilog.
34) I cannot reproduce the behavior you describe with regards to MOVNTDQ and also your asm snippets don't contain any MOVNTDQ. Maybe you have more details that you haven't shared?