Different results on different PC.

eliopoulos · ‎11-21-2023

Hi,

I have installed Windows 11, Visual Studio 2019 Community and Intel oneAPI 2021 Base and HPC Toolkit. When I run my FORTRAN program on a different PC with Windows 10, I get different results. I have tried not only the same compiler, but the same executable file as well. Is there anything I can do to solve this issue?

Best regards,

Dr. Elias N. Eliopoulos

Steve_Lionel · ‎11-21-2023

Take a look at Improving Numerical Reproducibility in C/C++/Fortran to get some ideas. Your expectations may need to be tempered, but a bit of effort on your part may get you where you want to go. You can try instrumenting the program to output intermediate results to a file and compare to see where they start to differ.

eliopoulos · ‎11-26-2023

Actually, I don't know what exactly to do. I need instructions...

JohnNichols · ‎11-26-2023

We are not mind readers, or gurus.

1. Provide us with a sample of the program is the best way

2. Show us the results as screen shots from the two computers

3. Describe what you are doing, ie numerically and what algorithms are you using - what results are different

4. If you do not make it easy for us, how can we help?

5. Are the differences of real interest so 9.999999 is really 10 etc...

andrew_4619 · ‎11-26-2023

How different, small differences or huge differences? The former is down to many things and is expected, that latter is normally bugs in your code such as uninitialised variables compiling with as many checking options as possible is to be advised and see what that throws up.

eliopoulos · ‎11-26-2023

My program is complicated calculating the fatigue life of wind turbine rotor blades that I would not like to share. I use the MKL library and specifically the dgetr routines. The results occur by applying cyclic load increments and progressive damage technics until a strain limit is reached. In some cases, the differences are substantial. For example, in one case I get approximately 4100000 cycles on the first PC and 5200000 on the other, with the same input. Sorry for not sharing the program. I think that more details of it would not be of your interest. The debug version of the program in Visual Studio finds nothing wrong. Sorry if I don't help enough.

jimdempseyatthecove · ‎11-26-2023

In cases where convergence is involved and results vary greatly, a common cause for the discrepancy is a poorly written convergence routine. For example, using a fixed (delta) limit as opposed to a relative limit. e.g. (abs(A - B) .lt. 1E-6)

The problem with that type of limit is that the precision is not relative to the numbers being compared. Consider using:

D = min(abs(A), abs(B)) * 1E-6) ! or use max should one of the control variables reach/cross 0.0.

IF((abs(A - B) < D) ...

Or construct a limit based upon the use of EPSILON for the type times the magnitude of the lesser number times the ln2 of the number of bits of desired precision.

Also keep in mind that scalar totals can vary from vector totals due to the difference in the sequence of summation and accompanying roundoff errors.

Also, this thread may be of interest.

Jim Dempsey

eliopoulos · ‎11-26-2023

My convergence routine is:

if (dabs(ex0i(n))>=elim) then

stop

endif

and I have set elim=0.1d0. I don't think that this is the problem. I assume that there is a precision difference in the dgetr routines or elsewhere and the small differences accumulate during the many millions of load steps used and produce differences in the results.

JohnNichols · ‎11-26-2023

If I measure the vibration of the turbine blades on day 1 and day n, then I will get a change in frequency of the blades, it is a simple exercise. It is likely to be linear over say a yr.

If the frequency changes as it will then you will get a rate of change of the stiffness and from that you can calibrate your fracture model.

Your blades are turning at say 1 once every 8 seconds, so you have 10,000 per day and hence 3 million per year, so your

4100000 cycles on the first PC and 5200000 on the other,

represents 13 to 15 years, I would have expected say 20. Does this match the measured data?

You could do the same calculation using say the routines from Numerical recipes in Fortran, you can get the 77 version for free I think?

JohnNichols · ‎11-26-2023

!     DGETRF (F07ADF) Example Program Text
!     Mark 15 Release. NAG Copyright 1991.
!
!*******************************************
!                                          *
! Modified by Intel Corporation, July 2017 *
!                                          *
!*******************************************
!
!     .. Parameters ..
      INTEGER          NIN, NOUT
      PARAMETER        (NIN=5,NOUT=6)
      INTEGER          MMAX, NMAX, LDA
      PARAMETER        (MMAX=8,NMAX=8,LDA=MMAX)
!     .. Local Scalars ..
      INTEGER          I, INFO, J, M, N
!     .. Local Arrays ..
      DOUBLE PRECISION A(LDA,NMAX)
      INTEGER          IPIV(NMAX)
!     .. External Subroutines ..
      EXTERNAL         PRINT_MATRIX
      EXTERNAL         DGETRF
!     .. Intrinsic Functions ..
      INTRINSIC        MIN
!     .. Executable Statements ..
      WRITE (NOUT,*) 'DGETRF Example Program Results'
!     Skip heading in data file
      open(nin, file="a.in")
      open(nout, file="a.out",status="UNKNOWN")
      READ (NIN,*)
      READ (NIN,*) M, N
      IF (M.LE.MMAX .AND. N.LE.NMAX) THEN
!
!        Read A from data file
!
         READ (NIN,*) ((A(I,J),J=1,N),I=1,M)
!
!        Factorize A
!
         CALL DGETRF(M,N,A,LDA,IPIV,INFO)
!
!        Print details of factorization
!
         WRITE (NOUT,*)
         CALL PRINT_MATRIX( 'Details of factorization', M, N, A, LDA )

!        Print pivot indices

         WRITE (NOUT,*)
         WRITE (NOUT,*) 'IPIV'
         WRITE (NOUT,99999) (IPIV(I),I=1,MIN(M,N))
!
         IF (INFO.NE.0) WRITE (NOUT,*) 'The factor U is singular'
!
      END IF
!
99999 FORMAT ((3X,7I11))
!
      STOP
      END
!
!     End of DGETRF Example
!
!  =============================================================================
!
!     Auxiliary routine: printing a matrix.
!
      SUBROUTINE PRINT_MATRIX( DESC, M, N, A, LDA )
      CHARACTER*(*)    DESC
      INTEGER          M, N, LDA
      DOUBLE PRECISION A( LDA, * )
!
      INTEGER          I, J
!
      WRITE(*,*) DESC
      WRITE(*, 9999) ( J, J = 1, N)
      DO I = 1, M
         WRITE(*, 9998) I, ( A( I, J ), J = 1, N )
      END DO
!
 9998 FORMAT( I2, ' ', 11(:,1X,F10.4) )
 9999 FORMAT( '   ', 11(:,1X,I10) )
!
      RETURN
      END

is this the procedure you are using? -

eliopoulos · ‎11-26-2023

For now, I just test the program. I have not used actual load data. The program I use is all mine and I just call the dgetr routines.

JohnNichols · ‎11-26-2023

No one is interested in your code or what you wrote, we assume here it is yours and it is always better not to publish it and breach all sorts of rights.

But this is the dgetrf routine in MKL, I am asking is this the one you used?

And if so how big is your input matrix as a test case?

JohnNichols · ‎11-26-2023

Provide a sample of the input matrix in 1,2,4,5, etc form as a text file -- and then we can compare your answers to others

see input file for sample.

JohnNichols · ‎11-27-2023

@Barbara

1. Your name tag does not pop up when one types "@Bar" it is quite interesting.

2. The program above in the set of posts is from the MKL samples, it will not run as written in modern F90 - I have changed it so it can read the input file and run as an F90 program. I have no great desire to get into the MKL forum just to say, this works, can you pass it along or do I have to do the hard work.

Thanks

John (extremely lazy and somewhat busy!)

eliopoulos · ‎11-27-2023

It is not that simple. First of all, during the loop, many millions of matrices are produced and I have to decide which one to post. Secondly, I have to understand how to print the matrix. I need some time.

andrew_4619 · ‎11-27-2023

I would just dump the matrix to a unformatted file.

You can then read an compare matrices from different runs on different computers easily.

integer :: iun
real :: mymat1(50,100), mymat2(50,100)
mymat =1.0
open(newunit=iun, form='unformatted',status='unknown')
write(iun) mymat1
close(iun)
open(iun, form='unformatted',status='old')
read(iun) mymat2
close(iun)
do L1 = 1 , size(mymat1,dim=2)
   do L2 = 1 , size(mymat2, dim= 1)
      if( abs(mymat1(l2,l1)-mymat2(l2,l1)> tolerance ) then
         do something
      endif
    enddo
endddo

andrew_4619 · ‎11-27-2023

dump some strategic matrices from a few points in the run from two computers and then compare to see where it starts to go wrong....

then home in to find the root cause

eliopoulos · ‎11-27-2023

The program is heavy. It will take many days to do so.

jimdempseyatthecove · ‎11-27-2023

>>The program is heavy. It will take many days to do so.

Yes, this happens. Be smart how you narrow in on the cause. This can save you a lot of time.

One recommendation I have is to instrument your code to contain trace and compare log file.

Be sure to provide for a tolerance, as there may be roundoff differences.

A good tolerance is tricky to provide. Generally, something like this:

if(expected_value == 0.0) then
  tolerance = epsilon(expected_value) * 2.0 ! two bits from 0.0
else
  tolerance = expected_value * epsilon(expected_value) * 2.0 ! two bits, from expected_value
endif

The above code does not handle NAN's nor huge nor tiny

Once you have the general location of where/when the divergence occurs, you narrow the search with additional points to log and compare

Jim Dempsey

JohnNichols · ‎11-28-2023

Start slow. Are all the numbers in the two output matrices from the different computers different, or just some?

If it is all, are they different by the same amount, do they scale relative to one another, is there some feature that pops out?

I would use Jim's idea, populate the code with write statements to a log file and then trace some number through the system looking for a change between the two computers? A million lines of code with 100 write statements, breaks down to 10000 unit blocks

You are essentially wanting to use the half system, this half of the code is ok, this is not, then half the half that is not until you quickly find the error, usually.

If you are sending a matrix to the subroutine, dump both of them from the two computers as a simple text file, one line per entry is fine and then dump the returned value and then post them, then we have a chance to look at them on different computers.

You will then give Intel and these fine lads and lassies a chance to look for the problem, if it is a subroutine problem, which given the age of the code must be not a likely failure mode, then Intel will be interested otherwise it is a long slog of pain?

eliopoulos · ‎11-28-2023

I am thinking of providing the program if it would help, but I would not like to do it in public for anyone to get. Is there an e-mail address I could send it to?