One of my users is running into a problem with the Intel 18.0.2 compiler and multiple-target dispatch.
The code was compiled with "-O2 -xCORE-AVX2 -axCORE-AVX512,MIC-AVX512". When run on either SKX or KNL, the version with multiple-target dispatch incorrectly processes a ternary assignment of the form "a += (a > b ? -1 : +1)*b" (where "a" is an "int" variable and "b" is a constant defined by a macro. When the initial value of "a" is "b+1", the code incorrect updates "a" to "2*b+1" instead of to "1".
Compiling with either "-xCORE-AVX512" or "-xMIC-AVX512" results in the correct behavior.
I have not been able to reproduce this in isolation, but the incorrect result is clear in the output of the user code.
Does this look like a known issue?
We have tried to reproduce using the latest version(19.1)of ICC.
We got the correct result.
We have tried with 18.0.2 also and got the correct result.
Could you ask him to upgrade to the latest version and see if the error persists?
Do let us know.
Sorry I was not clear -- the same error occurs when running on either SKX or KNL using the multi-target binary, and does not occur on either system with the appropriate directly-targeted binary.
The error occurs in the midst of a larger code, where it is used to implement periodic boundary conditions in three dimensions. When the error occurs, the computation is incorrect in two of the three indices.
"MIC-AVX512" is KNL -- KNC is no longer supported as a target.
I have asked the user to try different compiler versions -- still waiting on a response.
The workaround is trivial (keep two executables and use the correct one!), but this is the first bug I have seen in the multi-dispatch code and was mostly curious about whether it looked familiar.
The specific value for which this case fails is "b=524288" and "a=b+1". The correct interpretation of the ternary is a+=(-b), so the initial value of "a=b+1" turns into "a=1". With the directly target binaries, the code delivers this result for all three indices. With the multi-target binary, the first two indices are incorrectly updated as a+=b (the other branch of the ternary), so the initial value of "a=b+1" turns into "a=2*b+1". This specific behavior makes it seem less likely that the trouble is with an out-of-bounds memory reference or the like, and the occurrence with the initial value of 2^19 seems like it might be a clue....
I will run "objdump -d" on the three executables and see if I can find the exact spot where this ternary is evaluated. Probably not, but sometimes I get lucky....
We've addressed a number of issues wrt AVX-512 in later compiler versions, but I couldn't find your specific case. It's good that the User is willing to try the newer version. Please let us know how that goes.
>>and does not occur on either system with the appropriate directly-targeted binary.
Sorry I wasn't clear too. Seeing that each of the non-desired instruction sets should function (at least for the conditional move), my suggestion was to see if by eliminating the multi-target (tests) you observe the same/similar incorrect results. This would be an indicator as to if the error is in (caused by/side effect of) the dispatcher .OR. the "correct" target code for the designated architecture.
While this doesn't help you directly, it should aid the Intel support programmers in where to look.
>>does not occur on either system with the appropriate directly-targeted binary
This leaves you the opportunity to use your own dispatcher to call one of 2, 3, 4 routines, each compiled with a single targeted architecture. While the code will be larger, the single executable should run just as fast.
If the application is C++, you could use a separate namespace for each architecture.
For Fortran, you may have to name mangle (place suffix on procedure name).
Unfortunately the user is not able to switch to the Intel 19 compilers at this time.
I have made considerable progress in isolating the incorrect code, but don't yet have a standalone reproducer.
I was surprised at how different the generated code was for the various compiler options. Considering four cases that I have looked at in detail (all compiled with "-O2"):
For this code (embedded in a larger routine, which is embedded in a larger application):
// LEN is a constant value of 524288 (0x80000) // q is an int32 vector with 3 elements for (u_char dim = 0; dim < 3; ++dim) q[dim] += (q[dim] > LEN ? -1 : +1)*LEN;
Examination of the output of "objdump -d" shows that
In the incorrect version of the vector code, it looks like the compiler "forgot" to load the data into an xmm register before starting the computations!
Extracting just the vector code from (3) and (4):
|Correct version||Incorrect version||Comments|
|vmovq (%rsp),%xmm3||64-bit load = 1st 2 32-bit elements|
|vpcmpgtd 0x11cf73(%rip),%xmm3,%xmm1||seems important?|
|vpandn 0x11cf7b(%rip),%xmm1,%xmm0||vpandn 0x12ffe8(%rip),%xmm1,%xmm0||undefined xmm1 in wrong version|
|vpor %xmm1,%xmm0,%xmm2||vpor %xmm1,%xmm0,%xmm2|
|vpslld $0x13,%xmm2,%xmm4||vpslld $0x13,%xmm2,%xmm4|
|vmovq (%rsp),%xmm3||aha, now we load the data|
|vpaddd %xmm4,%xmm3,%xmm5||vpaddd %xmm4,%xmm3,%xmm5|
|vmovq %xmm5,(%rsp)||vmovq %xmm5,(%rsp)|
Could the value in %xmm1 have been hoisted upstream? Certainly possible, but there are no other SIMD register references in this function, or in the two functions that call this function. There are uses of xmm1 starting at two levels up the calling tree, but in all the cases I reviewed those were floating-point computations, while this is an integer calculation.
Perhaps these more detailed characteristics will make it clear whether this problem is related to any known bugs....
While this bug should be fixed, does re-expressing the code produce the correct result:
// LEN is a constant value of 524288 (0x80000) // q is an int32 vector with 3 elements for (u_char dim = 0; dim < 3; ++dim) q[dim] += q[dim] > LEN ? -LEN : LEN;
Also, if q is a triplet, within an array of triplets, why then is the code not taking advantage of the AVX512 by treating the array of triplets as an array of scalars? IOW use all 16 lanes of AVX512 instead of 2 lanes + 1 lane.
This bug was found with 18u2 (our default on the TACC Stampede2 system.
The error occurs when compiling the 3-target binary with either -O2 or -O3, but not with -O0 or -O1. There is no indication that the compiler generated multiple versions of the routine, so it is not surprising that the error occurs when running on either KNL or SKX.
With the 2-target binary (-xCORE-AVX2 -axCOMMON-AVX512) the code is correct at all four optimization levels. (Again, there is no indication that the compiler created multiple versions of the code.)
The KNL and SKX binaries generated purely scalar code here, while the 3-target and 2-target versions generated a combination of vector and scalar code. All of the SIMD instructions generated in both cases were AVX instructions (which are obviously supported by the "base" architecture of all four compilation targets). Weird.....
The application includes a bunch of infrastructure (including PetSC, Hypre, and p4est), so going to Intel 19 is a risk. (18u2 is also the latest compiler that is fully integrated into our module system and infrastructure on Stampede2.) The user reports having trouble with memory leaks with PetSC and Intel 19 in another application, making them a bit cautious.
The reason I asked to try with 19.1 is that I dont think we will release any more 18.0 update.
Is it possible to do a quick check by just compile that 1 file with 19.1?
It took a while to work around all the dependencies, but from examination of the assembly listing it is clear that the bug exists in Intel 18.0.2 and Intel 18.0.3, but is fixed in 18.0.4. The bug remains fixed in Intel 19.0.0, 19.0.1, 19.0.2, 19.0.3, and 19.0.5.
We have lots of workaround options, so I think we are done with this issue.
I remain curious about whether it "looks like" any known fixes in 18.0.4
More bounding details....
For this code, the error requires all of the following:
All of these cases show exactly the behavior described above -- the compiler somehow "forgets" that it needs to load the data and perform the compare operation before starting to manipulate the mask resulting from the compare.
Thanks for your investigation and verification. It looks like a known issue and has been fixed in the later versions. I just looked into our bug database again, but haven't been able to pinpoint to your case.