Disabling Vectorization in Windows

encoder · ‎06-08-2012

I am using Visual Studio 2010 as my IDE with Intel Composer integrated into it. This allows me to compile using the Intel C compiler.
My objective is to compare performance when running using /O2 optimization with vectorization enabled and /O2 optimization with vectorization disabled.
To achieve the latter configuration I have set my optimization level to /O2 and added /Qvec- to the command line of all the projects in my application. The build log confirms this.

However, the assembly files of the compiled code still contain references to xmm registers in most places (I'm assuming this is because SIMD vectorization is still happening). I've cleaned and rebuiilt a few times and I'm pretty sure this is fresh assembly code.
There is hardly any change in performance when compared to the configuration where I had not included /Qvec- in the command line, so I'm pretty sure it's vectorizing pretty much everywhere, just as before.

Is there anything else I need to check/ enable/ disable to make sure vectorization does not occur. Please note that I would still like it to do an O2 level of optimization, just without vectorizing my code.

Thanks.

Sukruth_H_Intel · ‎06-08-2012

Hi,
Could you please send me an ".asm" file with and without enabling vectorization? Because i tried this sample code :-

[sectionBody]#include int main() { int a[100],b[100],c[100],i; for(i=0;i<100;i++) b=c=i; for(i=0;i<100;i++) a=b+c; printf("%dn",a[20]); return 0; }
And i used the following command line:-
"icl /O2 /Qvec- ss.c /FA" and i could not see any XMM registers being used. Also according to my opinion
in order to verify whether it was vectorized or not we can check how the data is loaded into the registers.
Something like this:-
This is an vectorized case:-

.B1.2:: ; Preds .B1.2 .B1.9
movdqa XMMWORD PTR [32+rsp+rdx*4], xmm0 ;7.6
movdqa XMMWORD PTR [432+rsp+rdx*4], xmm0 ;7.1
add rdx, 4 ;6.1
paddd xmm0, xmm1 ;7.1
cmp rdx, 100 ;6.1
jb .B1.2

So in the above highlighted asm code i could see the way of computation and i can regard it as being vectorized.
Hope i am clear with the explanation. Please feel free to let me know if you have any further queries.

Thanks & Regards,
Sukruth H.V

[/sectionBody]

Sukruth_H_Intel · ‎06-08-2012

Hi,
It would be great if you can also attach the testcase, so that i can compile and check the asm in my machine. You can make the post "Private" if you dont want your code to be seen by others.

Thanks & regards,
Sukruth H.V

TimP · ‎06-08-2012

If you want a high level report on the difference which /Qvec- makes, please examine the result with /Qvec-report. It's entirely possible that you don't have significant vectorization where it could make a difference in your application.
When looking at asm, vectorized code is distinguished by parallel instructions which access memory.
Given that you are using the normal SSE2 code generation, the same registers are used with or without vectorization, and there are even parallel register moves to avoid phantom dependency on contents of the unused slots. You ought to see the similar thing in your MSVC code, if you set /arch:SSE2 (needed only for the 32-bit compiler).
If you're trying to retrace history, the requirement for parallel register moves in non-vector code was first documented publicly by AMD, but it applies to all SSE code. For a long time, the incorrect code with unexpected stalls continued to be generated by compilers targeting P-III.
The Intel compilers have dropped P-III support anyway. If you want code to run on P-III compatible CPU, you must use /arch:IA32 with current Intel compilers, in which case you should not see any use of xmm registers. Such code should be significantly slower than SSE2 non-vector code, at least on AMD CPUs. Pentium-m was the last CPU which had a preference for avoidance of SSE scalar code.

SergeyKostrov · ‎06-09-2012

Quoting encoder

...Is there anything else I need to check/ enable/ disable to make sure vectorization does not occur...

When working on integration of Intel C++ compiler I noticeddifferences in performance between Debug and
Release configutations when some optimizationsare disabled. The same applies to Microsoft C++ compiler.
So, did you try toverify results for both configurations? Also, are you building a 32-bit or 64-bit application?

Best regards,
Sergey

Brandon_H_Intel · ‎06-11-2012

Hi encoder,

I think there's a little confusion here. In addiction to the vectorizer which generates Intel Streaming SIMD Extensions, the compiler also generates scalar SSE as well, so even non-loop code can result in SSE code. If you're using a 32-bit compiler on Windows* or Linux*, you can use the /arch:IA32 or -mia32 options to disable this. Otherwise there's not an option to disable this as the minimum instruction set for those other platforms all support some version of SSE.