Running in to some issues with Composer XE 13 Beta code generation for AVX.
Specifically, in release builds on x86 it's not 32-byte-aligning function-scope AVX data on the stack. This causes an illegal instruction when the program executes.
This issue does not occur:- * When using the Microsoft C++ compiler for x86 or x64 (though they have their own issues...) * When using the Intel C++ compiler for x64, provided AVX code generation is switched on in the options.
Compiler version: 2013_beta_0.060 OS: Windows 8 release preview CPU: Sandy Bridge i5-2500 Architecture: x86
is it really an "illegal instruction" error? If yes, then the code you created (using AVX via option "/QxAVX") is executed on a system that does not have AVX instruction set extension. However, I don't think so, because "/QxAVX" usually adds a test to the main routine whether the underlying processor is able to execute AVX instructions. You won't be able to execute your application. If it's a library there won't be a test, though.
Another note: It's OK that unaligned loads/stores are used, even if the data is aligned. The reason is that 2nd & 3rd generation Intel Core processors execute both aligned & unaligned load/stores with same cycle count if the data accessed is actually aligned. That's an optimization of the underlying HW. The compiler favors the unaligned accesses because they're more flexible/portable (e.g. when using 3rd party libraries).
Edit: To be sure, would it be possible for you to create a small test case/reproducer? I'm kind of afraid that you're seeing this problem only with complex code... like for the other thread above. Some indicator whether we're facing the same problem would be to compile without optimization. If I'm right you shouldn't see the problem anymore. Can you verify this?
Hoping to clarify what is being reported here: I think OP is saying that __m256 data types don't get default 32-byte alignment unless AVX option is on, which can generate alignment fault when an AVX compiled function is called by a non-AVX one. If the Microsoft compiler requires __declspec(align(32)) in order for a function to call an ICL compiled AVX function, one hopes the same thing will work for ICL. As OP implies, it could be useful if declspec were not required. It would be useful to post a minimal example demonstrating the problem on premier.intel.com in order to get an assessment from the compiler team whether this situation could be improved upon. Georg seems to be diverting the subject into the question of how the compiler uses unaligned instructions in many cases even though it expects alignment. Even if the case in question were made to run correctly by such means, there could be a hidden performance penalty if the alignment isn't corrected.
Yes, you're right it's a SIGSEGV, access violation reading 0xFFFFFFFF.
The generated asm code is quite hard to disentangle, but it looks like it has something to do with parameter passing for an inline function. It looks like the temporary stack variables being generated to allow the data to be passed aren't aligned properly, so when the inlined code attempts to access them as if they were aligned, *boom*.
Sandy Bridge has a large penalty for 256-bit unaligned access which crosses a cache line boundary. The compiler normally splits such accesses down to 128-bit instructions when it expects frequent misalignment. Ivy Bridge is supposed to make a big improvement on misaligned moves, such that the single 256-bit instruction could be preferred over the split moves. As Georg pointed out, there should be no penalty for using movups in the case where the data are aligned.
Thanks! Seems that the best solution with the Intel compiler is to get rid of the inline asm, and use a 128-bit VEX version of "vpslld" to perform the shift operations (I didn't realise to begin with that "vpslld" 128-bit intrinsic is allowed under AVX whereas the 256-bit version is not, new to all this stuff). Code now running nice & smooth.