ifort 13.0 generates unneeded code during vectorization

styc · ‎10-14-2012

Compile the attached source file with '-O3' and either of '-xSSE4.2' and '-xAVX'. ifort 13.0 vectorizes the k-loop but generates an unneeded scalar version. Since the loop count is 4, in no cases can that scalar version be used.

By the way, the compiler seems to be too aggressive in vectorization. It generates simulated gathers for accesses to the o array. In order to use VPSLLD, it uses three instructions to pack four integers into a vector, then uses another 7 instructions to unpack them into four GPRs. It would have better to just use GPRs from the start and use SHL/LEA instead of VPSLLD.

styc · ‎10-18-2012

Just to be sure, is it that Intel is not interested in reports like this that are not bugs per se? If so, I will refrain from reporting something like this in the future.

Kevin_D_Intel · ‎10-18-2012

We apologize for the delayed reply. I overlooked it. We do appreciate reports such as this here. I will investigate your findings and post an update soon.

jimdempseyatthecove · ‎10-18-2012

Until the fix can be implimented, consider generating pick lists: [fortran] subroutine foo(c, ld, n, m, o, x) double precision, intent(in) :: c(0 : 3, 0 : 6) integer, intent(in) :: ld, n, m integer, intent(in) :: o(ld, m) double precision, intent(inout) :: x(ld, m) double precision t(0 : 3, 0 : 3), u, v, w integer i, k integer o0k(0:3), o1k(0:3), o2k(0:3) t = 0.d0 do i = 4, n - 5, 3 do k = 0, 3 o0k(k) = o(i, 1 + k) o1k(k) = o(i+1, 1 + k) o2k(k) = o(i+2, 1 + k) end do do k = 0, 3 u = t(k, 1) v = t(k, 2) w = t(k, 3) t(k, 1) = x(i, 1 + k) t(k, 2) = x(i + 1, 1 + k) t(k, 3) = x(i + 2, 1 + k) x(i, 1 + k) = & c(0, o0k(k)) * t(k, 1)& + c(1, o0k(k)) * (w + t(k, 2))& + c(2, o0k(k)) * (v + t(k, 3))& + c(3, o0k(k)) * (u + x(i + 3, 1 + k)) x(i + 1, 1 + k) = & c(0, o1k(k)) * t(k, 2)& + c(1, o1k(k)) * (t(k, 1) + t(k, 3))& + c(2, o1k(k)) * (w + x(i + 3, 1 + k))& + c(3, o1k(k)) * (v + x(i + 4, 1 + k)) x(i + 2, 1 + k) = & c(0, o2k(k)) * t(k, 3)& + c(1, o2k(k)) * (t(k, 2) + x(i + 3, 1 + k))& + c(2, o2k(k)) * (t(k, 1) + x(i + 4, 1 + k))& + c(3, o2k(k)) * (w + x(i + 5, 1 + k)) end do end do end subroutine [/fortran] Jim Dempsey

styc · ‎10-18-2012

@Jim With your modifications, the compiler no longer vectorizes the loop. That eliminates the root cause of all raised issues. If the second k-loop is force-vectorized with '!dec$ simd', the scalar version is still generated. Performancewise perhaps not vectorizing is a better decision. In the real code (not this reduced case), disabling vectorization with '!$dec novector' for this loop improves performance. I cannot tell whether that is because this loop is simply not worth vectorization, or the extra dependence caused by VPSLLD is incurring excessive delays. I can find no way to tell the compiler not to generate VPSLLD.

jimdempseyatthecove · ‎10-18-2012

>>With your modifications, the compiler no longer vectorizes the loop. That eliminates the root cause of all raised issues. If the second k-loop is force-vectorized with '!dec$ simd', the scalar version is still generated. From my programming perspective: My primary concern is not if the compiler reports vectorization or not, or if vectorization is used or not. Rather, that the compiler uses vectorization when it is appropriate (read faster code). With the pick list modifications, did the code run faster than without pick list (with and without explicit simd vectorization)? What I am trying to teach the readers of this thread is: Do not assume vectorization is always best (force it when not appropriate), and at times help out the compiler (e.g. incorporating the pick list). BTW - it was a good catch to look down to the disassembly level to notice the root cause of additional overhead. Not all posters do this. This is not as hard as it seams. Jim Dempsey

styc · ‎10-18-2012

jimdempseyatthecove wrote:
>>With your modifications, the compiler no longer vectorizes the loop. That eliminates the root cause of all raised issues. If the second k-loop is force-vectorized with '!dec$ simd', the scalar version is still generated.

From my programming perspective:

My primary concern is not if the compiler reports vectorization or not, or if vectorization is used or not.
Rather, that the compiler uses vectorization when it is appropriate (read faster code).

With the pick list modifications, did the code run faster than without pick list (with and without explicit simd vectorization)?

What I am trying to teach the readers of this thread is: Do not assume vectorization is always best (force it when not appropriate), and at times help out the compiler (e.g. incorporating the pick list).

BTW - it was a good catch to look down to the disassembly level to notice the root cause of additional overhead. Not all posters do this. This is not as hard as it seams.

Jim Dempsey

Maybe you misunderstood. I meant that the compiler generates a useless scalar remainder loop when it decides to vectorize. This is a separate issue from whether it makes good decisions on whether to vectorize or not.

I did not test your code, but now that it prevents vectorization, I presume that '!dec$ novector' will have the same effect (or better effect because the compiler does not need to worry about the o?k arrays). And yes, my measurements did show that disabling vectorization results in better performance. The uncertain point is, like I mentioned, the reason of such performance degradation. Whether this code is worth vectorizing cannot be immediately tested because compiler generates less-than-ideal code.

Kevin_D_Intel · ‎11-06-2012

Development continues to investigate and commented regarding the scalar version writing "Can't get rid of scalar code ---- if vectorized. The array dim size LD may be zero and the scalar code is used for that fall back path." I will update as I hear more and pass any comments back you may have.

styc · ‎11-06-2012

Kevin Davis (Intel) wrote:
Development continues to investigate and commented regarding the scalar version writing "Can't get rid of scalar code ---- if vectorized. The array dim size LD may be zero and the scalar code is used for that fall back path."

I will update as I hear more and pass any comments back you may have.

The value of ld cannot affect the vectorizability of the k-loop. k only ever appears in the second subscripts of references to the o and x arrays, which has nothing to do with ld. Furthermore, if ld is indeed zero, then the i-loop must not run, i.e., n must be less than 9, otherwise all accesses to o and x will be out of bounds. In this case, no code is ever needed, including the scalar loop.

Kevin_D_Intel · ‎11-07-2012

Thank you for the feedback. I failed to indicate earlier that this issue was reported to Developers under the internal tracking id, DPD200237580. I added your latest feedback and will update when I learn more. (Internal tracking id: DPD200237580)

styc · ‎11-07-2012

Kevin Davis (Intel) wrote:
Thank you for the feedback. I failed to indicate earlier that this issue was reported to Developers under the internal tracking id, DPD200237580. I added your latest feedback and will update when I learn more.

(Internal tracking id: DPD200237580)

Does the issue regarding the use of vector shift vs scalar shift has a tracking id as well?

Kevin_D_Intel · ‎11-07-2012

Both issues were reported under the same internal tracking id I noted earlier. If it becomes necessary to split them I will but Development is considering both issues at this time.