My motherboard died, so I - Page 2

Eric_L_2 · ‎04-25-2016

I have found a bug on Parallel Studio 16.0.2 where I get an error when computing the SVD with GESDD in the Python package SciPy. It can be reproduced on an MKL-built scipy with this array, which is finite (contains no NaN or inf) as:

>>> import numpy as np
>>> from scipy import linalg
>>> linalg.svd(np.load('fail.npy'), full_matrices=False)
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/larsoner/.local/lib/python2.7/site-packages/scipy/linalg/decomp_svd.py", line 119, in svd
    raise LinAlgError("SVD did not converge")
numpy.linalg.linalg.LinAlgError: SVD did not converge

I am curious if anyone has insight into why this fails, or can reproduce it themselves. I do have access to older MKL routines so if it's helpful I could see if I get the error elsewhere, too.

I have tried this with MKL-enabled Anaconda, and it does not fail, although I do experience similar failures with other arrays with the Anaconda version, which seem to only happen on systems with SSE4.2 but no AVX extensions.

I recently worked on SciPy's SVD routines to add a wrapper for a GESVD backend (to complement the existing GESDD routine) here, and this command passes on bleeding-edge SciPy, so it does seem to be a problem with the GESDD implementation specifically:

>>> linalg.svd(np.load('fail.npy'), full_matrices=False, lapack_driver='gesvd')

Eric_L_2 · ‎05-12-2016

Found another matrix that fails on my system with SSE4.2 only (not AVX), oddly it does not fail on Anaconda but it does fail on my self-built NumPy/SciPy stack with latest release Parallel studio and MKL. One difference between Anaconda and my build is that Anaconda doesn't use ifort, it uses gfortran (if that matters). Uploaded as bad_sse42_2.npy.

Eric_L_2 · ‎09-07-2016

I can confirm that with the latest release "icc (ICC) 17.0.0 20160721" I still have failures on my SSE42 machine with:

fail.npy
bad_sse42_3.npy

But it now works with:

bad.npy
bad_sse42_2.npy

So that's some progress at least. It's not easy for me to re-test the AVX failure (bad_avx.npy) so I'm not sure what the status is like there.

Eric_L_2 · ‎09-21-2016

I can confirm the same behavior is happening on the 2017.0.098 MKL update, namely that SVD with "bad.npy" and "bad_sse42_2.npy" still fail on my system.

However, I noticed in testing that it only fails most of the time. In maybe 1/5 or 1/10 cases, it will actually pass. This suggests to me there is possibly some memory problem going on, where the code is overwriting some memory that it shouldn't, and it only causes problems some of the time. But I should stress that even though I'm compiling my own numpy/scipy stack, most of the other folks I know who hit this issue use Anaconda Python, which gets compiled elsewhere -- so I don't think it's an issue specific to only my setup.

Eric_L_2 · ‎11-18-2016

Just wanted to mention that the problems persist with the latest Parallel Studio-compiled version (17.0.1 20161005), but now it only happens with `fail.npy` and `bad_sse42_2.npy`, and failures no longer seem random but instead consistent.

Ying_H_Intel · ‎11-20-2016

Hi Eric,

Thanks for the updating.

I'm just update another gesdd issue https://software.intel.com/en-us/forums/intel-math-kernel-library/topic/675058, so update here too

zgesdd the routine will cause an access violation (segmentation fault) for certain sizes of the matrix.

There is bug in gesdd (Insufficient size of rwork array). the issue should be targeted to be fixed in next version. ( supposed 2017 update 2)

let's wait to see if 2017 update 2 work.

Thanks

Ying

Eric_L_2 · ‎11-21-2016

Sure, I'll try it as soon as it comes out.

Eric_L_2 · ‎02-24-2017

I just updated to 2017 update 2 (icc --version 17.0.2 20170213), and unfortunately the same behavior exists (fail.npy and bad_sse42_2.npy both fail to converge).

Eric_L_2 · ‎03-09-2017

My motherboard died, so I replaced my CPU / motherboard combo with an i7700k Kaby Lake. I rebuilt NumPy and SciPy to take advantage of newer extensions, and all of the old matrices passed on this architecture.

However, I quickly found a new example that fails, which I have uploaded as bad_kabylake.npy. Does anyone have this CPU to try to replicate? I'm on the latest version (2017 update 2), compiled NumPy and SciPy from source, and did:

scipy.linalg.svd(np.load('bad_kabylake.npy'), full_matrices=False)

Alternatively, I also just tested this on Ananconda Python, and get the same failure. So that should hopefully further reduce the difficulty of testing / replicating.

Yaroslav_B_ · ‎05-01-2017

I'm having similar looking failures, using Anaconda Python SVD (linked against MKL).

Here's a self-contained reproducible example:

https://github.com/yaroslavvb/stuff/blob/master/svd_noconverge.py

It fails on our Xeon V3 machines, passes on Xeon V4

Xeon V3 info

processor : 31
vendor_id : GenuineIntel
cpu family : 6
model : 63
model name : Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz
stepping : 2
microcode : 0x36
cpu MHz : 1214.906
cache size : 20480 KB
physical id : 1

Oleksandr_P_Intel · ‎05-02-2017

Dear Yaroslav,

Please use

import ctypes
import numpy as np

def mklVersion():
    ver = np.zeros(199, dtype=np.uint8)
    mkl = ctypes.cdll.LoadLibrary("libmkl_rt.so")
    mkl.MKL_Get_Version_String(ver.ctypes.data_as(ctypes.c_char_p), 198)
    return ver[ver != 0].tostring()

# mklVersion()

to find out the version of MKL installed on both the machine where it fails and where it passes.

Also please provide outputs of `conda list --explicit` in these environments.
I was not able to reproduce the failure on the slightly newer Xeon v3:

[08:18:59 linmachine tmp]$ head /proc/cpuinfo | grep 'model name'
model name      : Intel(R) Xeon(R) CPU E5-2698 v3 @ 2.30GHz

Using Intel Distribution for Python 2017 update 2, which has scipy 0.18.1 and numpy 1.11.2.

I was also unable to reproduce the problem on the same machine using Intel Distribution for Python 2017 update 1.

Thank you,
Oleksandr

Yaroslav_B_ · ‎05-02-2017

I've updated the file to print out this info: https://github.com/yaroslavvb/stuff/blob/master/svd_noconverge.py

Will try with Intel Distribution Update 2 and update this

Yaroslav_B_ · ‎05-02-2017

So, I've tried running with intelpython from Distribution Update 2, same problem.

I've tried on several different Xeon V3's, same issue, but I only have access to V4 and Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz, no able to test on newer V3

Oleksandr_P_Intel · ‎05-02-2017

Dear Yaroslav,

I was able to reproduce the problem you are experiencing, and would like to thank you for your time to bring it to our attention.

The issue is under investigation by the MKL team. As a work-around, please try lowering the number of threads used by MKL.

Running on the same hardware, E5 2630 v3, I saw SVD converge using MKL_NUM_THREADS=15

$ MKL_NUM_THREADS=15 python svd_nonconverge.py
Read 2458624 bytes
Success

Best regards,
Oleksandr

Eric_L_2 · ‎05-02-2017

Although Yaroslav's array works fine on my machine (Kabylake i7-7700K), I can confirm that my previously reported (see post above from March 9) failure file (bad_kabylake.npy) fails with MKL_NUM_THREADS=4 but passes with MKL_NUM_THREADS=3.

sergio_r_ · ‎05-03-2017

Each one of the call functions:
linalg.svd(np.load(arrayname), full_matrices=False)
linalg.svd(np.load(arayname), full_matrices=False, lapack_driver='gesvd')
svd(np.load(arrayname), full_matrices=False)
with arraynames = ['fail.npy', 'bad_avx.npy', 'bad.npy', 'bad_sse42_2.npy'] and the Yaroslav's array runs in my setup:

   CPU info
   --------
model name   : Intel(R) Core(TM) i7-4700MQ CPU @ 2.40GHz
   MKL info
   --------
b'Intel(R) Math Kernel Library Version 2018.0.0 Beta Build 20170316 for Intel(R) 64 architecture applications'

Python distribution details:
3.5.2 |Intel Corporation| (default, Mar 27 2017, 10:34:52)
[GCC 4.8.2 20140120 (Red Hat 4.8.2-15)]

Installed Python Version is: 3.5.2
Installed Numpy version is: 1.11.3
Installed Scipy version is: 0.18.1

Salut,

Sergio
Enhance your #MachineLearning and #BigData skills via #Python #SciPy
1) https://www.packtpub.com/big-data-and-business-intelligence/numerical-and-scientific-computing-scipy-video
2) https://www.packtpub.com/big-data-and-business-intelligence/learning-scipy-numerical-and-scientific-computing-second-edition

Yaroslav_B_ · ‎05-04-2017

Sergio R -- you need Xeon E5 2630 v3 to reproduce it, it runs fine on other Xeon's even

Eric_L_2 · ‎05-11-2017

The failure persists on MKL (and Parallel Studio) 2017 Update 4.

Eric_L_2 · ‎11-15-2017

I still have the failure on my KabyLake system with the latest version (icc --version 18.0.1 20171018).

Ribes__Albert · ‎02-05-2019

I could not reproduce with none of the ndarray posted in this issue. But I came across another one which pails in my laptop. It is this one:

problematic_array.zip

When I ran it with

OMP_NUM_THREADS=1 python

It worked Ok, and it also worked when I switched to OpenBlas

Bug in GESDD (but not GESVD)