PARDISO results are not identical as repeatly run

chdthanh · ‎04-23-2010

Dear all,
I am using PARDISO solver of mkl 10.0. I set default parameters and mtype=-2. Each time I run my program, PARDISO gives slightly different results while I think the results must be identical. For example

8.007956138556888E-4 vs.
8.007956138556876E-4

or

6.178758439680843E-4 vs.
6.178758439680833E-4

Could you give me reasons why this slight difference occurs?

Thank you in advance.
Thanh

Andrew_Smith · ‎04-24-2010

I was told it was because the multi-threaded solution may complete its work in a different order each time we solve. Seams like a lame excuse to me and I still suspect some other issue.

In my application it makes PARDISO difficult to use. The changes in results are within the expected solution tolerance but since PARDISO cannot tell us the expected tolerance, we have use another solver to check! Hence PARDISO is redundant. I am still stuck using the old IMSL sparse solver.

Shane_S_Intel · ‎04-24-2010

While identical results from run to run are nice, array alignment and how the algorithm is parallelized may influence whether the library function returns the same result run to run. Dynamic parallelism makes that even more of a challenge, that is, yourtasks may be executed in a different order run to run, and that different order of operations influences what the final result will be. Take the simple example of adding the double precision numbers1 + (-1) + 2^-56 in different orders. (1 + (-1))+ 2^-56 = 2^-56, while 1 + ((-1) + 2^-56) = 0 due to the IEEE rounding which occurs with each floating-point operation. 2^-56 is very different than 0, but from a computer arithmetic point-of-view equally correct. The online MKL user's manual has a section where this topic is discussed anddescribes actionsyou cantake to better assure identical results run-to-run. It sometimes even requires running serially (on a single thread) which may negatively impact performance. A relative error in the range of 10^(-13) or 10^(-14) may be completely reasonable for the problem you're running, number of threads you are executing on,and the alignment of your input data.

chdthanh · ‎04-25-2010

To Andrew Smith: In your experience,isthe old IMSL sparse solverfaster than PARDISO solver? Thanks

chdthanh · ‎04-25-2010

To Shane Story: Thanks for your explanation. Could you please show me where the section in MKL user's manual you mention is?

Andrew_Smith · ‎04-26-2010

It varies with the problem but I would say PARDISO is faster even when run single threaded.

Andrew_Smith · ‎04-26-2010

To clarify, I was comparing PARDISO with DLSLXG

Shane_S_Intel · ‎04-26-2010

See the online Intel MKLLinux Users Guide (pdf format), section 8-1.

abhimodak · ‎04-26-2010

Hi

We were burnt by this issue (not by PARDISO but by other solvers in MKL) and had to resort to making sure that the array partitions are aligned correctly.

We have found mkl documenation not always helpful; however the array alignment issue in particular, has been discussed in chapter 6 of the current (mkl with compiler 11.1.065) user's guide.

On a different note, I think that if the difference of the order of absolute tolerance (for example, by rule of thumb 1e-14 when using Real(8) numbers, should not affect the results from an engineering point of view. If it does, it indicates that the system of equations is badly scaled or you need to rewrite the equations or you may want to introduce stringent flags, such as fp:precise, for floating pointer operations. We did find such places in our programs and are more watchful since. I suspect dynamic parallelization will place even more demands on writing programs.

Abhi

Andrew_Smith · ‎04-26-2010

So we have to supply PARDISO with 16 byte aligned workspace.

How do we supply 16 byte aligned workspace to PARDISO?

abhimodak · ‎04-26-2010

I think there is mkl_malloc that may be used.

However, We used a brute force approach i.e. we allocated N+1 elements and pass section 1:N or 2:N+1 as required. We thus use intrinsic function LOC and mod to see if the 16 byte alignment is there or not.

Abhi

Steven_L_Intel1 · ‎04-26-2010

In Fortran, you can use !DEC$ ATTRIBUTES ALIGN to specify alignment for a variable.

abhimodak · ‎04-26-2010

Hi Steve

I think you meant "in Intel Fortran". We were suspecting/fearful of similar issues with libraries on other "odd" platform we support and hence did not use any directives.

Abhi

Andrew_Smith · ‎04-27-2010

My inputs to PARDISO are components of a derived type. I specified /align:rec16byte for the file containing the derived type definition but I still got random results changes run to run. I will try LOC to see if alignment is achieved.

Andrew_Smith · ‎04-27-2010

I see from using LOC that the derived type variables are not aligned on 16 byte boundaries when using /align:rec16bytes.

Then I tried creating local copies of the data using mkl_malloc but the array descriptor of my local variable still said it had only one element and assigning the copy failed. e.g.

integer p
real*8 d(1)
pointer (p, d)
p = mkl_malloc(1000, 16)
d = myComponent%d

!Now d is only assigned the firat value

Steven_L_Intel1 · ‎04-27-2010

/align:rec16bytes controls the relative offset of fields in a derived type (or RECORD) from the base - it does not specify where the whole variable is allocated. In other words. /align:rec16bytes may insert padding inside the structure to realign fields, but that doesn't help if the structure base is not 16-byte aligned.

In Intel Fortran, one uses

!DEC$ ATTRIBUTES ALIGN:16 :: varname

to specify the base alignment of individual variables. These can be local variables or ALLOCATABLE. For example, to declare an array of REAL(8) so that the base is aligned 16-bytes, one could use:

REAL(8), ALLOCATABLE :: X(:)
!DEC$ ATTRIBUTES ALIGN:16 X
ALLOCATE (X(10000))

This assumes you use Fortran ALLOCATE - using other allocation methods, such as mkl_malloc, are not affected by the directive.

Andrew, in the source you provide, there is no way for the compiler to know how big d is. You declared it as one element (better would probably be to use (*)). Assuming you had done so, you could perhaps have written:

d(1:1000) = myComponent%d

but it would be far better if you made d an ALLOCATABLE array and allocated it to the proper size. Then everything works out for size and, if you choose, alignment.

Andrew_Smith · ‎04-28-2010

I tried using local variables with !DEC$ ATTRIBUTES ALIGN:16 :: varname as you suggested Steve.

I did this for all real and integer variables. I used stack variables rather than allocatable but I confirmed 16 byte alignment.

I did not do alignment for the control variable: type(MKL_PARDISO_HANDLE) :: mklMemoryPointers(64)

Then I do not get the same answers run to run for PARDISO with multithreading.

Then I also copied the mklMemoryPointers to a local aligned variable and still no luck.

Anybody else got this to work?

I could develop a test program with input data files if that would help.

basel · ‎04-28-2010

Andrew,

The MKL PARDISO Version is not able to compute the same answers with multithreading. Our new version PARDISO 4.0 from www.pardiso-project.org is the only parallel sparse direct solver that has this kind of functionality. Please take a look at the webpage. This option is currently only available for symmetric indefinite matrices and we hope to add this feature for all matrix types.

Olaf Schenk

Shane_S_Intel · ‎04-28-2010

Hi Olaf, how does this impact performancewhen comparing it todynamically scheduled parallel tasks? And are youcalling only sequential MKL BLAS from within your algorithm?

basel · ‎04-28-2010

>Hi Olaf, how does this impact performancewhen comparing it todynamically scheduled parallel tasks?

It is a completely redesign of the scheduling in the numerical factorzation. The aim was to improve threading scalabilty compared to PARDISO 3.3, and at the same time, we wanted to have identical results on multicores.

>And are youcalling only sequential MKL BLAS from within your algorithm?
We are using sequential MKL BLAS for each thread because multi-threaded MKL BLAS can not compute identical results.

barragan_villanueva_ · ‎04-29-2010

Hi,

Here is just a quote related to IEEE 754:

Floating point arithmetic is not associative. This means that in general for floating point numbers x, y, and z: In mathematics, associativity is a property that a binary operation can have. ...

(x + y) + z != x + (y + z)

(x * y) * z!= x * (y * z)

Floating point arithmetic is also not distributive. This means that in general: In mathematics, and in particular in abstract algebra, distributivity is a property of binary operations that generalises the distributive law from elementary algebra. ...

x * (y + z)!= (x * y) + (x * z)

In short, the order in which operations are carried out can change the output of a floating point calculation. This is important in numerical analysis since two mathematically equivalent formulas may not produce the same numerical output, and one may be substantially more accurate than the other.

Thus, bit-to-bit results can be computed if and only if all operations are executed in predefined order. It means that compiler must not change this order doing any optimizations.

Also its important to say, that multithreaded calculations cannot guarantee order of calculations because mostly they are dynamically scheduled regardless of each thread has ordered calculations. BTW, OpenMP cannot guarantee that parallel section will use the same team of threads on the second run.

At last, Id add the following statement from What Every Computer Scientist Should Know About Floating-Point Arithmetic, by David Goldberg, published in the March, 1991 issue of Computing Surveys.

Many programmers like to believe that they can understand the behavior of a program and prove that it will work correctly without reference to the compiler that compiles it or the computer that runs it. In many ways, supporting this belief is a worthwhile goal for the designers of computer systems and programming languages.

In my opinion, instead of requiring identical results of calculations every application should try to prove that used algorithm is stable, calculated deviation is validated and estimated to be sure that results are correct.