MKL reproductibility

Matthias_L_ · ‎11-12-2015

Is there any way to get deterministic results from MKL sgemm/dgemm (even if that is much slower)?

What I mean is the following: When I do dgemm or sgemm (a lot of them) using the same input data I tend to see minor numerical differences. While not being large they can become quite significant when back-propagating though a very deep neural network (>20 layers). And they are significantly higher than with competing linear algebra packages.

Let me show you what I mean. I instantiated my network twice and initialized both instances using the same parameters. The following tables list the differences of gradients derived using these networks. (each number in the in the table, represents the gradients for an entire parameter bucket)

Parameters (MKL)

MKL_NUM_THREADS=1
OMP_NUM_THREADS=1
MKL_DYNAMIC=FALSE
OMP_DYNAMIC=FALSE

Results (MKL, confirmed single threaded by using MKL_VERBOSE=1)

min-diff: (0,5) -> -2.43985e-07, (0,10) -> -6.88851e-07, (0,15) -> -1.08151e-06, (0,20) -> -2.29150e-07, (0,25) -> -7.78865e-06, (0,30) -> -2.22526e-07, (0,35) -> -2.00457e-05, (0,40) -> -6.31442e-07, (0,45) -> -3.53903e-08, (0,50) -> -1.33878e-09, (0,55) -> -3.72529e-09, (0,60) -> -4.65661e-10, (0,65) -> -1.86265e-09, (0,70) -> -2.32831e-09, (0,75) -> -1.16415e-10, (0,80) -> -1.86265e-08
max-diff: (0,5) ->  3.52116e-07, (0,10) ->  6.34780e-07, (0,15) ->  9.27335e-07, (0,20) ->  2.05655e-07, (0,25) ->  6.20843e-06, (0,30) ->  2.58158e-07, (0,35) ->  2.12293e-05, (0,40) ->  6.60219e-07, (0,45) ->  2.79397e-08, (0,50) ->  1.16415e-09, (0,55) ->  5.87897e-09, (0,60) ->  5.23869e-10, (0,65) ->  1.86265e-09, (0,70) ->  2.56114e-09, (0,75) ->  1.16415e-10, (0,80) ->  1.86265e-08
rel-diff: (0,5) ->  1.70455e-03, (0,10) ->  2.38793e-03, (0,15) ->  1.39107e-03, (0,20) ->  2.02584e-03, (0,25) ->  6.83717e-04, (0,30) ->  9.16173e-04, (0,35) ->  1.73014e-04, (0,40) ->  1.49317e-04, (0,45) ->  2.10977e-07, (0,50) ->  2.14790e-07, (0,55) ->  6.37089e-08, (0,60) ->  8.91096e-08, (0,65) ->  7.81675e-09, (0,70) ->  1.67285e-07, (0,75) ->  3.78540e-10, (0,80) ->  1.72134e-07

min-diff = min(A - B)

max-diff = max(A - B)

rel-diff = norm(A - B) / norm(A + B)

If I bind exactly the same application against the current stable OpenBLAS implementation compiled for single threading, I get the following:

Parameters (OpenBLAS)

make BINARY=64 TARGET=SANDYBRIDGE USE_THREAD=0 MAX_STACK_ALLOC=2048

Results OpenBLAS (single threaded)

min-diff: (0,5) -> 0.00000e+00, (0,10) -> 0.00000e+00, (0,15) -> 0.00000e+00, (0,20) -> 0.00000e+00, (0,25) -> 0.00000e+00, (0,30) -> 0.00000e+00, (0,35) -> 0.00000e+00, (0,40) -> 0.00000e+00, (0,45) -> 0.00000e+00, (0,50) -> 0.00000e+00, (0,55) -> 0.00000e+00, (0,60) -> 0.00000e+00, (0,65) -> 0.00000e+00, (0,70) -> 0.00000e+00, (0,75) -> 0.00000e+00, (0,80) -> 0.00000e+00
max-diff: (0,5) -> 0.00000e+00, (0,10) -> 0.00000e+00, (0,15) -> 0.00000e+00, (0,20) -> 0.00000e+00, (0,25) -> 0.00000e+00, (0,30) -> 0.00000e+00, (0,35) -> 0.00000e+00, (0,40) -> 0.00000e+00, (0,45) -> 0.00000e+00, (0,50) -> 0.00000e+00, (0,55) -> 0.00000e+00, (0,60) -> 0.00000e+00, (0,65) -> 0.00000e+00, (0,70) -> 0.00000e+00, (0,75) -> 0.00000e+00, (0,80) -> 0.00000e+00
rel-diff: (0,5) -> 0.00000e+00, (0,10) -> 0.00000e+00, (0,15) -> 0.00000e+00, (0,20) -> 0.00000e+00, (0,25) -> 0.00000e+00, (0,30) -> 0.00000e+00, (0,35) -> 0.00000e+00, (0,40) -> 0.00000e+00, (0,45) -> 0.00000e+00, (0,50) -> 0.00000e+00, (0,55) -> 0.00000e+00, (0,60) -> 0.00000e+00, (0,65) -> 0.00000e+00, (0,70) -> 0.00000e+00, (0,75) -> 0.00000e+00, (0,80) -> 0.00000e+00

This is actually what I would expect. Since there is no multi-threading and given the same data exactly the same things should happen in the same order.

Now, just for fun and because my software can do it, I replace the calls to BLAS with matching modules for CUDNN and CUBLAS function calls (NVIDIA CUDA). Please note that unlike (OpenBLAS and MKL), this is not the same code-path.

Results CUDNN + CUBLAS (multi-threaded)

min-diff: (0,5) -> -3.63798e-11, (0,10) -> -1.45519e-10, (0,15) -> -1.96451e-10, (0,20) -> -4.36557e-11, (0,25) -> -1.39698e-09, (0,30) -> -8.00355e-11, (0,35) -> -3.25963e-09, (0,40) -> -2.32831e-10, (0,45) -> -3.72529e-09, (0,50) -> -2.32831e-10, (0,55) -> 0.00000e+00, (0,60) -> 0.00000e+00, (0,65) -> 0.00000e+00, (0,70) -> 0.00000e+00, (0,75) -> 0.00000e+00, (0,80) -> 0.00000e+00
max-diff: (0,5) ->  2.91038e-11, (0,10) ->  1.40062e-10, (0,15) ->  2.18279e-10, (0,20) ->  4.72937e-11, (0,25) ->  9.31323e-10, (0,30) ->  1.01863e-10, (0,35) ->  2.79397e-09, (0,40) ->  2.32831e-10, (0,45) ->  1.86265e-09, (0,50) ->  2.91038e-10, (0,55) -> 0.00000e+00, (0,60) -> 0.00000e+00, (0,65) -> 0.00000e+00, (0,70) -> 0.00000e+00, (0,75) -> 0.00000e+00, (0,80) -> 0.00000e+00
rel-diff: (0,5) ->  2.06397e-07, (0,10) ->  5.70014e-07, (0,15) ->  2.27241e-07, (0,20) ->  3.68574e-07, (0,25) ->  1.57384e-07, (0,30) ->  2.85175e-07, (0,35) ->  8.01234e-08, (0,40) ->  1.15262e-07, (0,45) ->  1.21201e-08, (0,50) ->  1.45475e-08, (0,55) -> 0.00000e+00, (0,60) -> 0.00000e+00, (0,65) -> 0.00000e+00, (0,70) -> 0.00000e+00, (0,75) -> 0.00000e+00, (0,80) -> 0.00000e+00

As you can see, I get reproducible results for the last layers (right hand side, fully connected nn-layers = large matrix multiplications). For the other layers (left hand side, convolutions nn-layers = many small matrix multiplications) we see small differences (the CUDA manual suggests this is to be expected due to the way they schedule in their underlying multi-threading implementation). Anyway, even with that, the differences have a much smaller magnitude than with MKL on the CPU.

Question:

Considering that I desire reproducibility. How can I configure MKL to produce the same or at least more similar results if it is invoked with the same data several times?

Gennady_F_Intel · ‎11-12-2015

yes, it possible to have bit-to-bit results with MKL. Please have a look CNR ( Conditional Numerical Reproducibility ) feature from MKL referense manual and also this KB article would be useful for beginners: https://software.intel.com/en-us/articles/introduction-to-the-conditional-numerical-reproducibility-cnr

--Gennady

Matthias_L_ · ‎11-12-2015

Gennady Fedorov (Intel) wrote:

yes, it possible to have bit-to-bit results with MKL. Please have a look CNR ( Conditional Numerical Reproducibility ) feature from MKL referense manual and also this KB article would be useful for beginners: https://software.intel.com/en-us/articles/introduction-to-the-conditiona...

--Gennady

Thanks for the quick reply. Setting the the environment variable described in the linked article does the trick. Differences between two consecutive calls are now zero-point-zero. However, may I ask something else. Is my understanding correct when I assume that MKL_CBWR=AUTO forces the code-path that generally fits my CPU best without considering properties (e.g. matrix size) of the individual dgemm calls?

Hence, for an Sandy/Ivy-Bridge CPU this will lock onto the AVX-code-path, while on a Nehelem CPU it will lock onto a SSE4 path regardless whether at runtime some internal assessment would normally have choses (due to whatever reason) that the SSE2 is more suitable for an individual matrix-multiplication. Right?

TimP · ‎11-14-2015

I don't see what you have in mind for the possibility that SSE2 might be "more suitable" than SSE4, although it might perform better than "expected" if the CPU supports AVX.

One might think that MKL could use SSE3 instructions in some cases of complex arithmetic where they might perform better than AVX, although the Intel compilers don't prefer SSE3 in such cases, even with the advantage of better estimation of problem size.

Surely you could test interesting settings of MKL_CBWR to see if any have performance advantages for you.

Matthias_L_ · ‎11-15-2015

Tim Prince wrote:

I don't see what you have in mind for the possibility that SSE2 might be "more suitable" than SSE4, although it might perform better than "expected" if the CPU supports AVX.

One might think that MKL could use SSE3 instructions in some cases of complex arithmetic where they might perform better than AVX, although the Intel compilers don't prefer SSE3 in such cases, even with the advantage of better estimation of problem size.

Surely you could test interesting settings of MKL_CBWR to see if any have performance advantages for you.

That makes sense. Well, silly me to not think my statement through. The only legit use of a setting different from AUTO seems to be compatibility with past computation results then. Especially since AUTO gives you the somewhat best setting for your current configuration anyway. Well.. of course if one runs MKL enabled software on an AMD CPU this might be a completely different story. I would not be surprised if AUTO would select COMPATIBLE by default on such a setup. But I doubt that this is a problem that MKL users have to face the frequently.