Possible bug in CPU dispatch code

McCalpinJohn · ‎07-09-2020

One of my users is running into a problem with the Intel 18.0.2 compiler and multiple-target dispatch.

The code was compiled with "-O2 -xCORE-AVX2 -axCORE-AVX512,MIC-AVX512". When run on either SKX or KNL, the version with multiple-target dispatch incorrectly processes a ternary assignment of the form "a += (a > b ? -1 : +1)*b" (where "a" is an "int" variable and "b" is a constant defined by a macro. When the initial value of "a" is "b+1", the code incorrect updates "a" to "2*b+1" instead of to "1".

Compiling with either "-xCORE-AVX512" or "-xMIC-AVX512" results in the correct behavior.

I have not been able to reproduce this in isolation, but the incorrect result is clear in the output of the user code.

Does this look like a known issue?

PrasanthD_intel · ‎07-10-2020

Hi John,

We have tried to reproduce using the latest version(19.1)of ICC.

We got the correct result.

We have tried with 18.0.2 also and got the correct result.

Could you ask him to upgrade to the latest version and see if the error persists?

Do let us know.

Regards

Prasanth

McCalpinJohn · ‎07-10-2020

I have asked the user to try different compiler versions -- still waiting on a response.

The workaround is trivial (keep two executables and use the correct one!), but this is the first bug I have seen in the multi-dispatch code and was mostly curious about whether it looked familiar.

The specific value for which this case fails is "b=524288" and "a=b+1". The correct interpretation of the ternary is a+=(-b), so the initial value of "a=b+1" turns into "a=1". With the directly target binaries, the code delivers this result for all three indices. With the multi-target binary, the first two indices are incorrectly updated as a+=b (the other branch of the ternary), so the initial value of "a=b+1" turns into "a=2*b+1". This specific behavior makes it seem less likely that the trouble is with an out-of-bounds memory reference or the like, and the occurrence with the initial value of 2^19 seems like it might be a clue....

I will run "objdump -d" on the three executables and see if I can find the exact spot where this ternary is evaluated. Probably not, but sometimes I get lucky....

jimdempseyatthecove · ‎07-10-2020

John,

Do you get the error, on the specific machine, when you singly target the architecture?

Also, I thought MIC-AVX512 is KNC not KNL

Jim Dempsey

McCalpinJohn · ‎07-10-2020

Sorry I was not clear -- the same error occurs when running on either SKX or KNL using the multi-target binary, and does not occur on either system with the appropriate directly-targeted binary.

The error occurs in the midst of a larger code, where it is used to implement periodic boundary conditions in three dimensions. When the error occurs, the computation is incorrect in two of the three indices.

"MIC-AVX512" is KNL -- KNC is no longer supported as a target.

jimdempseyatthecove · ‎07-11-2020

>>and does not occur on either system with the appropriate directly-targeted binary.

Sorry I wasn't clear too. Seeing that each of the non-desired instruction sets should function (at least for the conditional move), my suggestion was to see if by eliminating the multi-target (tests) you observe the same/similar incorrect results. This would be an indicator as to if the error is in (caused by/side effect of) the dispatcher .OR. the "correct" target code for the designated architecture.

While this doesn't help you directly, it should aid the Intel support programmers in where to look.

>>does not occur on either system with the appropriate directly-targeted binary

This leaves you the opportunity to use your own dispatcher to call one of 2, 3, 4 routines, each compiled with a single targeted architecture. While the code will be larger, the single executable should run just as fast.

If the application is C++, you could use a separate namespace for each architecture.

For Fortran, you may have to name mangle (place suffix on procedure name).

Jim Dempsey

Viet_H_Intel · ‎07-10-2020

We've addressed a number of issues wrt AVX-512 in later compiler versions, but I couldn't find your specific case. It's good that the User is willing to try the newer version. Please let us know how that goes.

Thanks,

Viet

McCalpinJohn · ‎07-16-2020

Unfortunately the user is not able to switch to the Intel 19 compilers at this time.

I have made considerable progress in isolating the incorrect code, but don't yet have a standalone reproducer.

I was surprised at how different the generated code was for the various compiler options. Considering four cases that I have looked at in detail (all compiled with "-O2"):

Compiled for KNL only
Compiled for SKX only
Original multi-target: -xCORE-AVX2 -axCORE-AVX512,MIC-AVX512
New multi-target: -xCORE-AVX2 -axCOMMON-AVX512

For this code (embedded in a larger routine, which is embedded in a larger application):

// LEN is a constant value of 524288 (0x80000)
// q is an int32 vector with 3 elements
for (u_char dim = 0; dim < 3; ++dim)
   q[dim] += (q[dim] > LEN ? -1 : +1)*LEN;

Examination of the output of "objdump -d" shows that

Versions (1) and (2) are implemented with purely scalar code.
- For each of the thee dimensions, compiler places the values "LEN" and "-LEN" in two GPRs, compares q[dim] with LEN, uses a conditional move to define an update value, adds the selected update to the original, and stores the result.
Versions (3) and (4) are implemented with a combination of vector and scalar code.
- The first two dimensions are combined into a single piece of vector code.
- The third dimension is implemented with (effectively) the same scalar code that was used in variants (1) and (2).
- The vector code used by (3) and (4) is similar (but only (4) is correct).
Versions (1) (2) and (4) give the correct results for all three elements.
Version (3) gives the correct result only for the third dimension (computed using scalar arithmetic), and gives the same incorrect result for the first two dimensions.
- The incorrect result is the same as one would expect to see if the wrong branch of the ternary was chosen.

In the incorrect version of the vector code, it looks like the compiler "forgot" to load the data into an xmm register before starting the computations!

Extracting just the vector code from (3) and (4):

Correct version	Incorrect version	Comments
vmovq (%rsp),%xmm3		64-bit load = 1st 2 32-bit elements
vpcmpgtd 0x11cf73(%rip),%xmm3,%xmm1		seems important?
vpandn 0x11cf7b(%rip),%xmm1,%xmm0	vpandn 0x12ffe8(%rip),%xmm1,%xmm0	undefined xmm1 in wrong version
vpor %xmm1,%xmm0,%xmm2	vpor %xmm1,%xmm0,%xmm2
vpslld $0x13,%xmm2,%xmm4	vpslld $0x13,%xmm2,%xmm4
	vmovq (%rsp),%xmm3	aha, now we load the data
vpaddd %xmm4,%xmm3,%xmm5	vpaddd %xmm4,%xmm3,%xmm5
vmovq %xmm5,(%rsp)	vmovq %xmm5,(%rsp)

Could the value in %xmm1 have been hoisted upstream? Certainly possible, but there are no other SIMD register references in this function, or in the two functions that call this function. There are uses of xmm1 starting at two levels up the calling tree, but in all the cases I reviewed those were floating-point computations, while this is an integer calculation.

Perhaps these more detailed characteristics will make it clear whether this problem is related to any known bugs....

jimdempseyatthecove · ‎07-16-2020

While this bug should be fixed, does re-expressing the code produce the correct result:

// LEN is a constant value of 524288 (0x80000)
// q is an int32 vector with 3 elements
for (u_char dim = 0; dim < 3; ++dim)
   q[dim] += q[dim] > LEN ? -LEN : LEN;

Also, if q is a triplet, within an array of triplets, why then is the code not taking advantage of the AVX512 by treating the array of triplets as an array of scalars? IOW use all 16 lanes of AVX512 instead of 2 lanes + 1 lane.

Jim Dempsey

Viet_H_Intel · ‎07-16-2020

Hi "Dr. Bandwidth",

I guess you were able to reproduce this with 18.0 compiler and wonder if you have a chance to try the 19.1 version?

Thanks,

Viet

McCalpinJohn · ‎07-16-2020

This bug was found with 18u2 (our default on the TACC Stampede2 system.

The error occurs when compiling the 3-target binary with either -O2 or -O3, but not with -O0 or -O1. There is no indication that the compiler generated multiple versions of the routine, so it is not surprising that the error occurs when running on either KNL or SKX.

With the 2-target binary (-xCORE-AVX2 -axCOMMON-AVX512) the code is correct at all four optimization levels. (Again, there is no indication that the compiler created multiple versions of the code.)

The KNL and SKX binaries generated purely scalar code here, while the 3-target and 2-target versions generated a combination of vector and scalar code. All of the SIMD instructions generated in both cases were AVX instructions (which are obviously supported by the "base" architecture of all four compilation targets). Weird.....

The application includes a bunch of infrastructure (including PetSC, Hypre, and p4est), so going to Intel 19 is a risk. (18u2 is also the latest compiler that is fully integrated into our module system and infrastructure on Stampede2.) The user reports having trouble with memory leaks with PetSC and Intel 19 in another application, making them a bit cautious.

Viet_H_Intel · ‎07-16-2020

The reason I asked to try with 19.1 is that I dont think we will release any more 18.0 update.

Is it possible to do a quick check by just compile that 1 file with 19.1?

Thanks,

Viet

McCalpinJohn · ‎07-17-2020

It took a while to work around all the dependencies, but from examination of the assembly listing it is clear that the bug exists in Intel 18.0.2 and Intel 18.0.3, but is fixed in 18.0.4. The bug remains fixed in Intel 19.0.0, 19.0.1, 19.0.2, 19.0.3, and 19.0.5.

We have lots of workaround options, so I think we are done with this issue.

I remain curious about whether it "looks like" any known fixes in 18.0.4

McCalpinJohn · ‎07-17-2020

More bounding details....

For this code, the error requires all of the following:

Intel compiler revision (18.0.2 OR 18.0.3), AND
Optimization level ("-O2" OR "-O3"), AND
Base target architecture of ("-xAVX" OR "-xCORE-AVX2" OR "-xCOMMON-AVX512"), AND
One or more optional target architectures that include "-axCORE-AVX512"

All of these cases show exactly the behavior described above -- the compiler somehow "forgets" that it needs to load the data and perform the compare operation before starting to manipulate the mask resulting from the compare.

Viet_H_Intel · ‎07-17-2020

Thanks for your investigation and verification. It looks like a known issue and has been fixed in the later versions. I just looked into our bug database again, but haven't been able to pinpoint to your case.

PrasanthD_intel · ‎07-21-2020

Hi John,

Since your issue has been resolved we are closing this thread.

If you require additional assistance please start a new thread.

Regards

Prasanth