I have a complex code that calls many Fortran and MKL functions which I need to optimize it. As the first step, I've started with compiler optimization -O 3. I've compiled my code once with MKL and the Intel compilers version 14 and another time with version 16. I used the verbose flag to get more information and I found that each version can vectorize a different set of loops. Although v14 can vectorize more loops, the compiled code with v16 performed better because more critical loops were vectorized. I expected to see that v16 can vectorize more loops or at least all of the loops that v14 vectorized but this is not the case. I am interested to know why this happens and also to see if anybody had some experience that over vectorization could cause slow down in the code?
I should also add that I tested the code on Intel(R) Xeon(R) CPU E3-1240 v3 @ 3.40GHz processors. The overall performance of v16 was better than v14. For some cases (depending upon the input options that trigger different parts of the code) the performance was improved up to 33% compared to v14.
Any other hint in this respect is greatly appreciated.