MKL - different result in each processor?

holysword · ‎05-01-2013

Hello,

I am having problems with an MPI code using Intel MKL and ifort (Composer version: 13.1.0.146). Each processor has exactly the same matrix, and they should be able to perform some sequential operations. Each processor is expected o obtain exactly the same values, since they are using the same binaries, same libraries and each node is in fact identical (2 Sandy Bridge EP E5-2670 processors in each node). However, routines as CGEMM and CGESVD produce slightly different values in each processor, a variantion of the order of 1e-6~1e-8. This does not always happen, and it seem to depend on the number of processors being used.

Is this behaviour expected at all? The difference is below the machine precision (considering single precision) but aren't the individual cores suppose to perform the roundoffs in the same manner? If this behaviour is not expected I could provide some example matrices.

Thanks in advance

TimP · ‎05-01-2013

identical results would require assurance of same data alignment mod 32 byte or using the slower consistency option

Zhang_Z_Intel · ‎05-01-2013

holysword wrote:

Is this behaviour expected at all? The difference is below the machine precision (considering single precision) but aren't the individual cores suppose to perform the roundoffs in the same manner?

Thank you for asking this question! MKL does have a way to guarantee identical results as long as some preconditions are met. We call this feature "Conditional Numerical Reproducibility". See here for a complete discussion on how to use this feature: http://software.intel.com/en-us/articles/conditional-numerical-reproducibility-cnr-in-intel-mkl-110

SergeyKostrov · ‎05-02-2013

>>...However, routines as CGEMM and CGESVD produce slightly different values in each processor, a variantion of the >>order of 1e-6~1e-8... Please verify what MKL DLLs are used on both computers. For example, it is possible that on Computer A mkl_def.dll is used and on Computer B mkl_avx.dll is used. You should always verify what set of CPU dependant DLLs ( also known as Waterfall DLLs ) is used on different computers in order to get identical results of calculations.

holysword · ‎05-03-2013

Thank you very much TimP, Zhang Z and Sergey Kostrov.

Setting KMP_DETERMINISTIC_REDUCTION=yes and MKL_BWR=SSE4_2 solves the issue with no noticeable slowdown. I still compile with the same optimization flags (including -xAVX). I tried to use MKL_BWR=AVX but that didn't work, I wonder why; all the processors are the same, and they are all EP E5-2670. All the dlls and libraries are the same also.

TimP · ‎05-03-2013

You didn't say whether you took care to set all local data passed to MKL on 32-byte boundaries (16-byte may be sufficient if you avoid AVX, but 32 may improve performance, even with SSE).

The variations you quote are consistent with single precision vector sum reduction on arrays of differing alignment. You could check each address passed to MKL % 16 for consistency. If you succeed in using the non-deterministic AVX it may not be the identical result as the "deterministic" one.

DETERMINISTIC_REDUCTION may not permit use of AVX-256 as that could require different blocking, incompatible with consistent results.

SergeyKostrov · ‎05-03-2013

>>...You didn't say whether you took care to set all local data passed to MKL on 32-byte boundaries (16-byte may be sufficient if you >>avoid AVX, but 32 may improve performance, even with SSE)... I recently did a set of tests with CRT malloc ( default alignment ) and MKL mkl_malloc ( allows to set different allignments ) functions and I didn't see any performance gains when calculating a product of two matricies using MKL sdemm and dgemm functions..

TimP · ‎05-03-2013

in my tests 32 byte alignment is of more benefit on early core i7 so I agree it may not appear on latest CPU.

holysword · ‎05-04-2013

TimP (Intel) wrote:
You didn't say whether you took care to set all local data passed to MKL on 32-byte boundaries (16-byte may be sufficient if you avoid AVX, but 32 may improve performance, even with SSE).

I am sorry, what do you mean with 32-byte boundaries? All variables are defined with the default kind ( that is, just REAL, COMPLEX and INTEGER, no DOUBLE PRECISION, KIND declaration or anything of that sort).

SergeyKostrov · ‎05-06-2013

For example, in case of arrays you could try a command line option as follows: ifort /align:array32byte...