I suppose the compiler developers forgot to change the documentation to DO loop or array assignment back when they agreed to make the directive apply to the latter. If you are a stickler for terminology, the optimization reports shouldn't be referrring to loops. When I set /arch:SSE4.1, and remove the VECTOR ALWAYS directives, the compiler I have active currently reports vectorization of 8 of the 9 cross product assignments which involve arithmetic. I don't see an obvious reason why just 1 of those 9 should be reported inefficient. 9 are reported vectorized for CORE-AVX2 (which we could run only on the SDE); only 3 of those optimizations are on the same assignments as SSE4.1 vectorizes. 21 assignments for SSE4.1 are reported as "completely unrolled" which probably is the way to go if the alignments aren't known and the compiler is attempting to avoid inefficiency on a variety of CPU models. AVX2 architecture evidently is designed to handle several such operations without concern about alignment. The difference between SSE4.1 and 4.2? According to my observation, the compiler is more likely to consider vectorization with split memory accesses to remove alignment issues, when using this option. Several Intel CPU models where the architecture manuals recommend against split memory accesses do in fact benefit from them when the alignments vary.
When the compiler can verify all args are aligned to 32 (64, 128, ...) then it will generate a call to the user specified aligned subroutine, when any args have unknown alignment then the subroutine without alignment would be called. The user could extend this to all permutaitons of alignment.