I'm using the Intel compiler beta version 13 update 2 and currently trying to get the most important critical bottleneck loops auto-vectorized. I am getting the following output in release mode and I find the messages a bit confusing:
/Users/bravegag/code/fastcode_project/code/src/sfo_matrix.h(1215): (col. 6) warning #13379: loop was not vectorized with "simd" /Users/bravegag/code/fastcode_project/code/src/sfo_matrix.h(1215): (col. 6) remark: SIMD LOOP WAS VECTORIZED.
How can I tell if it was or it wasn't? can anyone please advice how I can get the compiler to show me the assembly and how can I check whether a specific loop in my code was auto-vectorized? is there a tutorial somewhere for this?
-S option generates an asm output. It looks like you may have vector and non-vector versions for various cases. If you profile by VTune or an oprofile based profiler, you can see where the time is spent, or you might simply time the loop to see whether it is faster than the no vector case.
Thank you for your answer. What do you mean? that the compiler generates both vectorized and not vectorized code for the same loop? which one will it execute? I find it a bit tricky to see whether it vectorizes by looking at the elapsed time ... I think the -S option and checking the assembly would be a better choice ...
indeed the compiler can generate different versions of the same code, to some extend. Imagine you've got two pointers as parameters to a function that have dependencies to each other. If you don't use the keyword "restrict" the pointers can overlap. Overlapping restricts vectorization. What the compiler is now doing is that it is adding some code to test for overlapping pointers, during run-time. In accordance it generates proper code that deals with either result (so-called multi-version code). Compiler heuristics, however, limit that to a practical extend. You can override this by using "-opt-multi-version-aggressive" (Linux* & Mac OS* X) or "/Qopt-multi-version-aggressive" (Windows*) BTW.
So, what you're seeing is that the compiler generated two version of your code, one is vectorized, the other not. Admittedly, the messages are not easy to understand; we're currently working on improving the vectorization report messages for future releases.
Another hint regarding Intel® VTune™ Amplifier XE: Depending on your processor & selected SIMD feature you can count the (issued!) packed instructions for SSE (e.g. FP_COMP_OPS_EXE.SSE_PACKED_[SINGLE|DOUBLE]) and AVX (e.g. SIMD_FP_256.PACKED_[SINGLE|DOUBLE]). The more of such instructions are executed compared to the overall instruction count the better the vectorization. Please keep in mind that those events only count "issued" instructions, AFAIK. Anyways, for a basic analysis whether vectorization was used or not it is accurate enough.