I'd like to share some results of performance evaluation for Intel(R) C++ Compiler XE Version 12 and Microsoft (R) C/C++ Optimizing Compiler Version 14. Tests are done for a C++ template function for matrix multiplication. Please see next post for results.
When options /O2 and /O3 are used Intel C++ compiler outperformed Microsoft C++ compiler by 43% and 44% respectively.
In my experience, the Microsoft compiler doesn't perform any vectorizations beyond those performed by ICL at /fp:source or precise; this appears consistent with your result. Although Microsoft doesn't use any restrict or pragma assertions in optimizations that I've been able to find, it does fairly well at /arch:AVX under those restrictions.
If you dictated all important optimizations by intrinsics, I'd expect it to be possible to bring Microsoft compiler up to parity. Even without using intrinsics, the Intel compiler is clever now about unroll_and_jam optimizations for matrix multiplication, even getting up into the size range where OpenMP is useful. For larger matrices, of course, you likely would use MKL or (with MSVC) some other performance library.