Solved: Non-value-safe optimization despite -fp-model precise

djungl · ‎03-02-2023

Hello,

I am observing different floating point results from a loop, depending on whether it is compiled with optimization or not. I use -fp-model precise so this seems unexpected.

This is my function/loop

typedef struct {
  int n;
  int qn;
} Param;

void function(Param *param, int *qpnt, double *qdiag, double *qnzs,
              int *qidx, double *x, double *ax)
{
   int i, j;

   for (i = 1; i <= param->n; ++i) ax[i] = 0e0;
   if (param->qn > 0) {
      if (qpnt) {
         for (i = 1; i <= param->n; ++i) {
            for (j = qpnt[i]; j < qpnt[i + 1]; ++j) {
               ax[qidx[j]] += x[i] * qnzs[j];
               ax[i] += x[qidx[j]] * qnzs[j];
            }
         }
      }
      for (i = 1; i <= param->n; ++i) ax[i] += x[i] * qdiag[i];
   }
}

I am using icc (ICC) 19.0.4.243 20190416 and compile as follows:

icc -O2 -fno-alias -align -fp-model precise -axAVX -fPIC -c -o loop.o loop.c

Looking at function.R in the generated assembly code, it looks like the nested loops in the above code are optimized as follows (note how results are accumulated in register xmm2 before they are committed to ax[i]):

for (i = 1; i <= param->n; ++i) {
  $xmm2 = 0
  for (j = qpnt[i]; j < qpnt[i + 1]; ++j) {
    ax[qidx[j]] += x[i] * qnzs[j];
    $xmm2 += x[qidx[j]] * qnzs[j];
  }
  ax[i] += $xmm2
}

This does not look a value-safe optimization to me: it changes the order in which things are accumulated in ax[i] and depending on the actual values that may lead to different round-off errors. The optimization is reordering the addition of floating point numbers.

Am I misunderstanding the effect of -fp-model precise or am I missing some other flags?
My goal is that optimized and non-optimized code produces the same numbers.

It seems the reordering disappears if I drop the -fno-alias but I don't see how this could be related.

SeshaP_Intel · ‎04-07-2023

Hi,

Thanks for your patience.

This was not a bug. In this case, ICC did more optimizations if "-fno-alias" is used. The option shouldn't be used when indexes can overlap.

We recommend you not use no-alias where there is a possibility of the indexes can overlap.

Thanks and Regards,

Pendyala Sesha Srinivas

View solution in original post

SeshaP_Intel · ‎03-03-2023

Hi,

Thank you for posting in Intel Communities.

Could you please provide us with complete sample reproducer code so that we can investigate the issue from our end?

Could you please try with latest Intel oneAPI C++ Compiler(icx) and let us know the results?

Could you please provide us the results of both optimized and non-optimized code you are trying?

By default -fp-model=fast flag will be used by the compiler. The compiler uses more aggressive optimizations on floating-point calculations.

Thanks and Regards,

Pendyala Sesha Srinivas

djungl · ‎03-03-2023

Hello,

I have attached object files and disassembled version of the object files.

Right now I cannot try oneAPI and I also have no easy way to extract the data from our application that shows the problem. However, I don't think this is necessary since IMO the issue is obvious in the assembly code. In loop.optimized.s you can see this code:

  cf:   c5 f9 57 c0             vxorpd %xmm0,%xmm0,%xmm0
 ...
  eb:   c5 f9 28 d0             vmovapd %xmm0,%xmm2
 ...
 11b:   c5 eb 58 d7             vaddsd %xmm7,%xmm2,%xmm2
 11f:   c4 a1 7b 11 2c d8       vmovsd %xmm5,(%rax,%r11,8)
 125:   4c 3b cb                cmp    %rbx,%r9
 128:   72 d0                   jb     fa <function.R+0xaa>
 12a:   c5 eb 58 4c d0 08       vaddsd 0x8(%rax,%rdx,8),%xmm2,%xmm1
 130:   c5 fb 11 4c d0 08       vmovsd %xmm1,0x8(%rax,%rdx,8)
 136:   48 ff c2                inc    %rdx
 139:   49 ff c0                inc    %r8
 13c:   49 3b d6                cmp    %r14,%rdx
 13f:   72 92                   jb     d3 <function.R+0x83>

As far as I understand, the instruction at address eb sets xmm2=0. Then at address 11b you can see how x[qidx[j]] * qnzs[j] is accumulated in xmm2 rather than in ax[i]. Finally, at address 12a/130 the results of xmm2 are added to ax[i] after the inner loop.

Some (fake) input where things may go wrong is this:

qpnt[] = { -1, 0, 2, 4 }
qidx[] = { 2, 3, 1, 3 }

Then for i = 1 the inner loop sets (only considering the first statement in the loop)

ax[2] += x[1] * qnzs[0] // ax[qidx[j]] += x[i] * qnzs[j] for j=0

ax[3] += x[2] * qnzs[1] // ax[qidx[j]] += x[i] * qnzs[j] for j=1

Now for i = 2, the second statement in the loop will do

ax[2] += x[1] * qnzs[2] // ax[i] += x[qidx[j]] * qnzs[j] for j=2

ax[2] += x[3] * qnzs[3] // ax[i] += x[qidx[j]] * qnzs[j] for j=3

in non-optimized mode. In other words, it computes (ax[2] + x[1] * qnzs[2]) + x[3] * qnzs[3].

On the other hand, in optimized mode the code will compute ax[2] + (x[1] * qnzs[2] + x[3] * qnzs[3]), which is not guaranteed to yield the same value, so it is not a value-safe optimization. This optimization would only be safe if ax[2]=0 when i=2. However, that is not true since the for i=1 the outer loop changed ax[2] from 0.

I don't understand your comment about -fp-model fast. I explicitly specify -fp-model precise. Are you saying that somehow gets overwritten?

SeshaP_Intel · ‎03-08-2023

Hi,

It would be greatly helpful if you provide both sample reproducer codes(.c/ .cpp) to us so that we can investigate this issue more from our end.

Thanks and Regards,

Pendyala Sesha Srinivas

djungl · ‎03-08-2023

I'll try to get something but it will not be easy.

In the meantime, what is your take on the generated assembly code? This just looks plain wrong to me, doesn't it? Or was I making wrong assumptions?

Also, again, what was your comment about -fp-model fast supposed to mean? I am worried that you were saying that the compiler would overwrite an explicit -fp-model precise?

djungl · ‎03-10-2023

I now have a way to reproduce the different answers (depending on whether -fno-alias is set or not).

The code is attached. `make all` prints this for me:

rm -f loop-alias loop-no-alias
icc -O2 -align -fp-model precise -axAVX -prec_div -o loop-alias loop.c loop_m.c -lm
echo "alias"
alias
./loop-alias
Result:       1.000000e-12
icc -O2 -align -fp-model precise -axAVX -prec_div -fno-alias -o loop-no-alias loop.c loop_m.c -lm
echo "fno-alias"
fno-alias
./loop-no-alias
Result:       0.000000e+00

As you can see, the results differ even though `-fp-model precise` is specified. With that option, we would expect only value-safe optimizations and thus the same results in either case.

SeshaP_Intel · ‎03-16-2023

Hi,

In your case, the compiler will consider -fp-model precise after the explicit declaration, it will tell the compiler to strictly adhere to value-safe optimizations when implementing floating-point calculations.

If you did not specify any floating point option then only the compiler will consider -fp-model=fast=1 flag.

Could you please try with the latest Intel oneAPI C++ Compiler(icx)?

As the Intel(R) C++ Compiler Classic (ICC) is deprecated and it will be removed from product release in the second half of 2023.

We would recommend you use the Intel(R) oneAPI DPC++/C++ Compiler (ICX).

Could you please try making the following changes in your Makefile and let us know the results?

CC = icx
CFLAGS = -O2 -align -fp-model precise -axAVX

We are getting the same result in both cases. Please let us know if you still face any issues.

Thanks and Regards,

Pendyala Sesha Srinivas

djungl · ‎03-16-2023

Hello,

I could not reproduce the problem with icx, more precisely with "Intel(R) oneAPI DPC++/C++ Compiler 2022.2.1 (2022.2.1.20221020)".

However, I would still like you to confirm whether this is a bug in the compiler or not? And if so, which versions of the compiler are affected, which are fixed? Or is this a misuse of the `-fno-alias` flag on our side (if so, could you please explain why)?

Thanks a lot,

Daniel

SeshaP_Intel · ‎03-20-2023

Hi,

We have tried with Intel(R) oneAPI DPC++/C++ Compiler 2023.0.0 (2023.0.0.20221201). We are getting the same result in both cases.

Please try both cases and observe the results. Please let us know if you face any issues with the latest Intel(R) oneAPI DPC++/C++ Compiler 2023.0.0 (2023.0.0.20221201).

Thanks and Regards,

Pendyala Sesha Srinivas

SeshaP_Intel · ‎03-24-2023

Hi,

Has the information provided above helped? If yes, could you please confirm whether we can close this thread from our end?

Thanks and Regards,

Pendyala Sesha Srinivas

djungl · ‎03-26-2023

No, sorry, your replies did not answer my question. Like I said, with oneAPI the issues are gone but we still don't know whether this is due to "random luck" or whether some problem was fixed when moving from icc to icx.

Can you confirm that the behavior we saw from icc was a bug in the old compiler and that are use of -fno-alias is not supposed to wreak havoc in this case?

SeshaP_Intel · ‎03-31-2023

Hi,

We are working on this issue internally. We will get back to you soon.

Thanks and Regards,

Pendyala Sesha Srinivas

Frank_R_1 · ‎04-06-2023

Hi,

We had similar problems, use the following to get bit identical results on linux, windows regardless of debug or release:

windows icl
-fp:consistent -Qimf-arch-consistency:true

linux icc
-fp-model consistent -imf-arch-consistency=true

linux icx
-fp-model=precise -fimf-arch-consistency=true

windows icx
-fp=precise -Qimf-arch-consistency:true

Best regards

Frank

djungl · ‎04-11-2023

Thanks a lot for this tip. I'll have to check whether/how this affects performance of our application. Just to be sure: "-fp-model consistent" implies "-fp-model precise"? This is not clear to me from the documentation.

Frank_R_1 · ‎04-12-2023

Hi,

for icc/icl we used -fp-model precise for C/C++/Fortran on Windows/Linux on older Intel compilers. Then -fp-model consistent was introduced which guarantees bitwise reproducible results regardless on which processor or platform.

For icx we had to change this to the above, because icx does not know about -fp-model consistent.

We also use cbwr model for MKL and also cbwr for MPI for bit identical results.

Best regards

Frank

djungl · ‎04-12-2023

Thank you! I will give this a try.

SeshaP_Intel · ‎04-07-2023

Hi,

Thanks for your patience.

This was not a bug. In this case, ICC did more optimizations if "-fno-alias" is used. The option shouldn't be used when indexes can overlap.

We recommend you not use no-alias where there is a possibility of the indexes can overlap.

Thanks and Regards,

Pendyala Sesha Srinivas

djungl · ‎04-11-2023

Thank you for getting to the bottom of this. I'll replace "-fno-alias" by "-ansi-alias".

SeshaP_Intel · ‎04-12-2023

Hi,

Thanks for accepting our solution. If you need any additional information, please post a new question as this thread will no longer be monitored by Intel.

Thanks and Regards,

Pendyala Sesha Srinivas