Intel® Distribution for Python*
Engage in discussions with community peers related to Python* applications and core computational packages.

Bug in GESDD (but not GESVD)

Eric_L_2
Beginner
4,090 Views

I have found a bug on Parallel Studio 16.0.2 where I get an error when computing the SVD with GESDD in the Python package SciPy. It can be reproduced on an MKL-built scipy with this array, which is finite (contains no NaN or inf) as:

>>> import numpy as np
>>> from scipy import linalg
>>> linalg.svd(np.load('fail.npy'), full_matrices=False)
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/larsoner/.local/lib/python2.7/site-packages/scipy/linalg/decomp_svd.py", line 119, in svd
    raise LinAlgError("SVD did not converge")
numpy.linalg.linalg.LinAlgError: SVD did not converge

I am curious if anyone has insight into why this fails, or can reproduce it themselves. I do have access to older MKL routines so if it's helpful I could see if I get the error elsewhere, too.

I have tried this with MKL-enabled Anaconda, and it does not fail, although I do experience similar failures with other arrays with the Anaconda version, which seem to only happen on systems with SSE4.2 but no AVX extensions.

I recently worked on SciPy's SVD routines to add a wrapper for a GESVD backend (to complement the existing GESDD routine) here, and this command passes on bleeding-edge SciPy, so it does seem to be a problem with the GESDD implementation specifically:

>>> linalg.svd(np.load('fail.npy'), full_matrices=False, lapack_driver='gesvd')

 

0 Kudos
39 Replies
Eric_L_2
Beginner
1,263 Views

Found another matrix that fails on my system with SSE4.2 only (not AVX), oddly it does not fail on Anaconda but it does fail on my self-built NumPy/SciPy stack with latest release Parallel studio and MKL. One difference between Anaconda and my build is that Anaconda doesn't use ifort, it uses gfortran (if that matters). Uploaded as bad_sse42_2.npy.

0 Kudos
Eric_L_2
Beginner
1,263 Views

I can confirm that with the latest release "icc (ICC) 17.0.0 20160721" I still have failures on my SSE42 machine with:

fail.npy
bad_sse42_3.npy

But it now works with:

bad.npy
bad_sse42_2.npy

So that's some progress at least. It's not easy for me to re-test the AVX failure (bad_avx.npy) so I'm not sure what the status is like there.

0 Kudos
Eric_L_2
Beginner
1,263 Views

I can confirm the same behavior is happening on the 2017.0.098 MKL update, namely that SVD with "bad.npy" and "bad_sse42_2.npy" still fail on my system.

However, I noticed in testing that it only fails most of the time. In maybe 1/5 or 1/10 cases, it will actually pass. This suggests to me there is possibly some memory problem going on, where the code is overwriting some memory that it shouldn't, and it only causes problems some of the time. But I should stress that even though I'm compiling my own numpy/scipy stack, most of the other folks I know who hit this issue use Anaconda Python, which gets compiled elsewhere -- so I don't think it's an issue specific to only my setup.

0 Kudos
Eric_L_2
Beginner
1,263 Views

Just wanted to mention that the problems persist with the latest Parallel Studio-compiled version (17.0.1 20161005), but now it only happens with `fail.npy` and `bad_sse42_2.npy`, and failures no longer seem random but instead consistent.

0 Kudos
Ying_H_Intel
Employee
1,263 Views

Hi Eric, 

Thanks for the updating.  

I'm just update another gesdd issue https://software.intel.com/en-us/forums/intel-math-kernel-library/topic/675058, so update here too 

 zgesdd the routine will cause an access violation (segmentation fault) for certain sizes of the matrix.

There is  bug in gesdd (Insufficient size of rwork array). the issue should be targeted to be fixed in next version. ( supposed 2017 update 2)

let's wait to see if 2017 update 2 work. 

Thanks

Ying

 

0 Kudos
Eric_L_2
Beginner
1,263 Views

Sure, I'll try it as soon as it comes out.

0 Kudos
Eric_L_2
Beginner
1,263 Views

I just updated to 2017 update 2 (icc --version 17.0.2 20170213), and unfortunately the same behavior exists (fail.npy and bad_sse42_2.npy both fail to converge).

0 Kudos
Eric_L_2
Beginner
1,263 Views

My motherboard died, so I replaced my CPU / motherboard combo with an i7700k Kaby Lake. I rebuilt NumPy and SciPy to take advantage of newer extensions, and all of the old matrices passed on this architecture.

However, I quickly found a new example that fails, which I have uploaded as bad_kabylake.npy. Does anyone have this CPU to try to replicate? I'm on the latest version (2017 update 2), compiled NumPy and SciPy from source, and did:

scipy.linalg.svd(np.load('bad_kabylake.npy'), full_matrices=False)

Alternatively, I also just tested this on Ananconda Python, and get the same failure. So that should hopefully further reduce the difficulty of testing / replicating.

0 Kudos
Yaroslav_B_
Beginner
1,263 Views

I'm having similar looking failures, using Anaconda Python SVD (linked against MKL).

Here's a self-contained reproducible example:

https://github.com/yaroslavvb/stuff/blob/master/svd_noconverge.py

It fails on our Xeon V3 machines, passes on Xeon V4

Xeon V3 info

processor       : 31
vendor_id       : GenuineIntel
cpu family      : 6
model           : 63
model name      : Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz
stepping        : 2
microcode       : 0x36
cpu MHz         : 1214.906
cache size      : 20480 KB
physical id     : 1

0 Kudos
Oleksandr_P_Intel
1,263 Views

Dear Yaroslav, 

Please use 

import ctypes
import numpy as np

def mklVersion():
    ver = np.zeros(199, dtype=np.uint8)
    mkl = ctypes.cdll.LoadLibrary("libmkl_rt.so")
    mkl.MKL_Get_Version_String(ver.ctypes.data_as(ctypes.c_char_p), 198)
    return ver[ver != 0].tostring()

# mklVersion()

to find out the version of MKL installed on both the machine where it fails and where it passes.

Also please provide outputs of `conda list --explicit` in these environments.
I was not able to reproduce the failure on the slightly newer Xeon v3: 

[08:18:59 linmachine tmp]$ head /proc/cpuinfo | grep 'model name'
model name      : Intel(R) Xeon(R) CPU E5-2698 v3 @ 2.30GHz

Using Intel Distribution for Python 2017 update 2, which has scipy 0.18.1 and numpy 1.11.2. 


I was also unable to reproduce the problem on the same machine using Intel Distribution for Python 2017 update 1.

Thank you,
Oleksandr

 

0 Kudos
Yaroslav_B_
Beginner
1,263 Views

I've updated the file to print out this info: https://github.com/yaroslavvb/stuff/blob/master/svd_noconverge.py

Will try with Intel Distribution Update 2 and update this

Read 2458624 bytes from $url
SVD failure
SVD did not converge
--------------------------------------------------------------------------------
MKL version
b'Intel(R) Math Kernel Library Version 2017.0.1 Product Build 20161005 for Intel(R) 64 architecture applications'
--------------------------------------------------------------------------------
Conda version
# This file may be used to create an environment using:
# $ conda create --name <env> --file <this file>
# platform: linux-64
@EXPLICIT
https://repo.continuum.io/pkgs/free/linux-64/libgfortran-3.0.0-1.tar.bz2
https://repo.continuum.io/pkgs/free/linux-64/mkl-2017.0.1-0.tar.bz2
https://repo.continuum.io/pkgs/free/linux-64/numpy-1.12.1-py35_0.tar.bz2
https://repo.continuum.io/pkgs/free/linux-64/openssl-1.0.2k-1.tar.bz2
https://repo.continuum.io/pkgs/free/linux-64/pip-9.0.1-py35_1.tar.bz2
https://repo.continuum.io/pkgs/free/linux-64/python-3.5.3-1.tar.bz2
https://repo.continuum.io/pkgs/free/linux-64/readline-6.2-2.tar.bz2
https://repo.continuum.io/pkgs/free/linux-64/scipy-0.19.0-np112py35_0.tar.bz2
https://repo.continuum.io/pkgs/free/linux-64/setuptools-27.2.0-py35_0.tar.bz2
https://repo.continuum.io/pkgs/free/linux-64/sqlite-3.13.0-0.tar.bz2
https://repo.continuum.io/pkgs/free/linux-64/tk-8.5.18-0.tar.bz2
https://repo.continuum.io/pkgs/free/linux-64/wheel-0.29.0-py35_0.tar.bz2
https://repo.continuum.io/pkgs/free/linux-64/xz-5.2.2-1.tar.bz2
https://repo.continuum.io/pkgs/free/linux-64/zlib-1.2.8-3.tar.bz2
--------------------------------------------------------------------------------
CPU version
model name      : Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz

0 Kudos
Yaroslav_B_
Beginner
1,263 Views

So, I've tried running with intelpython from Distribution Update 2, same problem.

I've tried on several different Xeon V3's, same issue, but I only have access to V4 and Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz, no able to test on newer V3

(intel) yaroslav@2:~/temp$ intelpython ~/git0/stuff/svd_noconverge.py
Read 2458624 bytes from $url
SVD failure
SVD did not converge
--------------------------------------------------------------------------------
MKL version
b'Intel(R) Math Kernel Library Version 2017.0.2 Product Build 20170126 for Intel(R) 64 architecture applications'
--------------------------------------------------------------------------------
Conda version
# This file may be used to create an environment using:
# $ conda create --name <env> --file <this file>
# platform: linux-64
@EXPLICIT
https://repo.continuum.io/pkgs/free/linux-64/libgfortran-3.0.0-1.tar.bz2
https://repo.continuum.io/pkgs/free/linux-64/mkl-2017.0.1-0.tar.bz2
https://repo.continuum.io/pkgs/free/linux-64/numpy-1.12.1-py35_0.tar.bz2
https://repo.continuum.io/pkgs/free/linux-64/openssl-1.0.2k-1.tar.bz2
https://repo.continuum.io/pkgs/free/linux-64/pip-9.0.1-py35_1.tar.bz2
https://repo.continuum.io/pkgs/free/linux-64/python-3.5.3-1.tar.bz2
https://repo.continuum.io/pkgs/free/linux-64/readline-6.2-2.tar.bz2
https://repo.continuum.io/pkgs/free/linux-64/scipy-0.19.0-np112py35_0.tar.bz2
https://repo.continuum.io/pkgs/free/linux-64/setuptools-27.2.0-py35_0.tar.bz2
https://repo.continuum.io/pkgs/free/linux-64/sqlite-3.13.0-0.tar.bz2
https://repo.continuum.io/pkgs/free/linux-64/tk-8.5.18-0.tar.bz2
https://repo.continuum.io/pkgs/free/linux-64/wheel-0.29.0-py35_0.tar.bz2
https://repo.continuum.io/pkgs/free/linux-64/xz-5.2.2-1.tar.bz2
https://repo.continuum.io/pkgs/free/linux-64/zlib-1.2.8-3.tar.bz2
--------------------------------------------------------------------------------
CPU version
model name      : Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz

0 Kudos
Oleksandr_P_Intel
1,263 Views

Dear Yaroslav,

I was able to reproduce the problem you are experiencing, and would like to thank you for your time to bring it to our attention.

The issue is under investigation by the MKL team. As a work-around, please try lowering the number of threads used by MKL. 

Running on the same hardware, E5 2630 v3, I saw SVD converge using MKL_NUM_THREADS=15

$ MKL_NUM_THREADS=15 python svd_nonconverge.py
Read 2458624 bytes
Success

Best regards,
Oleksandr

0 Kudos
Eric_L_2
Beginner
1,263 Views

Although Yaroslav's array works fine on my machine (Kabylake i7-7700K), I can confirm that my previously reported (see post above from March 9) failure file (bad_kabylake.npy) fails with MKL_NUM_THREADS=4 but passes with MKL_NUM_THREADS=3.

 

0 Kudos
sergio_r_
Novice
1,263 Views

 

Each one of the call functions:
  linalg.svd(np.load(arrayname), full_matrices=False)
  linalg.svd(np.load(arayname), full_matrices=False, lapack_driver='gesvd')
  svd(np.load(arrayname), full_matrices=False)
with arraynames = ['fail.npy', 'bad_avx.npy', 'bad.npy', 'bad_sse42_2.npy'] and the Yaroslav's array runs in my setup:

     CPU info
     --------
model name    : Intel(R) Core(TM) i7-4700MQ CPU @ 2.40GHz
     MKL info
     --------
b'Intel(R) Math Kernel Library Version 2018.0.0 Beta Build 20170316 for Intel(R) 64 architecture applications'


Python distribution details:
3.5.2 |Intel Corporation| (default, Mar 27 2017, 10:34:52)
[GCC 4.8.2 20140120 (Red Hat 4.8.2-15)]


Installed Python Version is: 3.5.2
Installed Numpy version is: 1.11.3
Installed Scipy version is: 0.18.1


Salut,

Sergio
Enhance your #MachineLearning and #BigData skills via #Python #SciPy
1) https://www.packtpub.com/big-data-and-business-intelligence/numerical-and-scientific-computing-scipy-video
2) https://www.packtpub.com/big-data-and-business-intelligence/learning-scipy-numerical-and-scientific-computing-second-edition

 

0 Kudos
Yaroslav_B_
Beginner
1,263 Views

Sergio R -- you need Xeon E5 2630 v3 to reproduce it, it runs fine on other Xeon's even

0 Kudos
Eric_L_2
Beginner
1,263 Views

The failure persists on MKL (and Parallel Studio) 2017 Update 4.

0 Kudos
Eric_L_2
Beginner
1,263 Views

I still have the failure on my KabyLake system with the latest version (icc --version 18.0.1 20171018).

0 Kudos
Ribes__Albert
Beginner
1,263 Views

I could not reproduce with none of the ndarray posted in this issue. But I came across another one which pails in my laptop. It is this one: 

problematic_array.zip

When I ran it with 

OMP_NUM_THREADS=1 python

It worked Ok, and it also worked when I switched to OpenBlas

0 Kudos
Reply