Intel® Distribution for Python*
Engage in discussions with community peers related to Python* applications and core computational packages.

Bug in GESDD (but not GESVD)

Eric_L_2
Principiante
8.649 Visualizações

I have found a bug on Parallel Studio 16.0.2 where I get an error when computing the SVD with GESDD in the Python package SciPy. It can be reproduced on an MKL-built scipy with this array, which is finite (contains no NaN or inf) as:

>>> import numpy as np
>>> from scipy import linalg
>>> linalg.svd(np.load('fail.npy'), full_matrices=False)
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/larsoner/.local/lib/python2.7/site-packages/scipy/linalg/decomp_svd.py", line 119, in svd
    raise LinAlgError("SVD did not converge")
numpy.linalg.linalg.LinAlgError: SVD did not converge

I am curious if anyone has insight into why this fails, or can reproduce it themselves. I do have access to older MKL routines so if it's helpful I could see if I get the error elsewhere, too.

I have tried this with MKL-enabled Anaconda, and it does not fail, although I do experience similar failures with other arrays with the Anaconda version, which seem to only happen on systems with SSE4.2 but no AVX extensions.

I recently worked on SciPy's SVD routines to add a wrapper for a GESVD backend (to complement the existing GESDD routine) here, and this command passes on bleeding-edge SciPy, so it does seem to be a problem with the GESDD implementation specifically:

>>> linalg.svd(np.load('fail.npy'), full_matrices=False, lapack_driver='gesvd')

 

0 Kudos
39 Respostas
Eric_L_2
Principiante
3.200 Visualizações

Found another matrix that fails on my system with SSE4.2 only (not AVX), oddly it does not fail on Anaconda but it does fail on my self-built NumPy/SciPy stack with latest release Parallel studio and MKL. One difference between Anaconda and my build is that Anaconda doesn't use ifort, it uses gfortran (if that matters). Uploaded as bad_sse42_2.npy.

Eric_L_2
Principiante
3.200 Visualizações

I can confirm that with the latest release "icc (ICC) 17.0.0 20160721" I still have failures on my SSE42 machine with:

fail.npy
bad_sse42_3.npy

But it now works with:

bad.npy
bad_sse42_2.npy

So that's some progress at least. It's not easy for me to re-test the AVX failure (bad_avx.npy) so I'm not sure what the status is like there.

Eric_L_2
Principiante
3.200 Visualizações

I can confirm the same behavior is happening on the 2017.0.098 MKL update, namely that SVD with "bad.npy" and "bad_sse42_2.npy" still fail on my system.

However, I noticed in testing that it only fails most of the time. In maybe 1/5 or 1/10 cases, it will actually pass. This suggests to me there is possibly some memory problem going on, where the code is overwriting some memory that it shouldn't, and it only causes problems some of the time. But I should stress that even though I'm compiling my own numpy/scipy stack, most of the other folks I know who hit this issue use Anaconda Python, which gets compiled elsewhere -- so I don't think it's an issue specific to only my setup.

Eric_L_2
Principiante
3.200 Visualizações

Just wanted to mention that the problems persist with the latest Parallel Studio-compiled version (17.0.1 20161005), but now it only happens with `fail.npy` and `bad_sse42_2.npy`, and failures no longer seem random but instead consistent.

Ying_H_Intel
Moderador
3.200 Visualizações

Hi Eric, 

Thanks for the updating.  

I'm just update another gesdd issue https://software.intel.com/en-us/forums/intel-math-kernel-library/topic/675058, so update here too 

 zgesdd the routine will cause an access violation (segmentation fault) for certain sizes of the matrix.

There is  bug in gesdd (Insufficient size of rwork array). the issue should be targeted to be fixed in next version. ( supposed 2017 update 2)

let's wait to see if 2017 update 2 work. 

Thanks

Ying

 

Eric_L_2
Principiante
3.200 Visualizações

Sure, I'll try it as soon as it comes out.

Eric_L_2
Principiante
3.200 Visualizações

I just updated to 2017 update 2 (icc --version 17.0.2 20170213), and unfortunately the same behavior exists (fail.npy and bad_sse42_2.npy both fail to converge).

Eric_L_2
Principiante
3.200 Visualizações

My motherboard died, so I replaced my CPU / motherboard combo with an i7700k Kaby Lake. I rebuilt NumPy and SciPy to take advantage of newer extensions, and all of the old matrices passed on this architecture.

However, I quickly found a new example that fails, which I have uploaded as bad_kabylake.npy. Does anyone have this CPU to try to replicate? I'm on the latest version (2017 update 2), compiled NumPy and SciPy from source, and did:

scipy.linalg.svd(np.load('bad_kabylake.npy'), full_matrices=False)

Alternatively, I also just tested this on Ananconda Python, and get the same failure. So that should hopefully further reduce the difficulty of testing / replicating.

Yaroslav_B_
Principiante
3.200 Visualizações

I'm having similar looking failures, using Anaconda Python SVD (linked against MKL).

Here's a self-contained reproducible example:

https://github.com/yaroslavvb/stuff/blob/master/svd_noconverge.py

It fails on our Xeon V3 machines, passes on Xeon V4

Xeon V3 info

processor       : 31
vendor_id       : GenuineIntel
cpu family      : 6
model           : 63
model name      : Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz
stepping        : 2
microcode       : 0x36
cpu MHz         : 1214.906
cache size      : 20480 KB
physical id     : 1

Oleksandr_P_Intel
Funcionário
3.200 Visualizações

Dear Yaroslav, 

Please use 

import ctypes
import numpy as np

def mklVersion():
    ver = np.zeros(199, dtype=np.uint8)
    mkl = ctypes.cdll.LoadLibrary("libmkl_rt.so")
    mkl.MKL_Get_Version_String(ver.ctypes.data_as(ctypes.c_char_p), 198)
    return ver[ver != 0].tostring()

# mklVersion()

to find out the version of MKL installed on both the machine where it fails and where it passes.

Also please provide outputs of `conda list --explicit` in these environments.
I was not able to reproduce the failure on the slightly newer Xeon v3: 

[08:18:59 linmachine tmp]$ head /proc/cpuinfo | grep 'model name'
model name      : Intel(R) Xeon(R) CPU E5-2698 v3 @ 2.30GHz

Using Intel Distribution for Python 2017 update 2, which has scipy 0.18.1 and numpy 1.11.2. 


I was also unable to reproduce the problem on the same machine using Intel Distribution for Python 2017 update 1.

Thank you,
Oleksandr

 

Yaroslav_B_
Principiante
3.200 Visualizações

I've updated the file to print out this info: https://github.com/yaroslavvb/stuff/blob/master/svd_noconverge.py

Will try with Intel Distribution Update 2 and update this

Read 2458624 bytes from $url
SVD failure
SVD did not converge
--------------------------------------------------------------------------------
MKL version
b'Intel(R) Math Kernel Library Version 2017.0.1 Product Build 20161005 for Intel(R) 64 architecture applications'
--------------------------------------------------------------------------------
Conda version
# This file may be used to create an environment using:
# $ conda create --name <env> --file <this file>
# platform: linux-64
@EXPLICIT
https://repo.continuum.io/pkgs/free/linux-64/libgfortran-3.0.0-1.tar.bz2
https://repo.continuum.io/pkgs/free/linux-64/mkl-2017.0.1-0.tar.bz2
https://repo.continuum.io/pkgs/free/linux-64/numpy-1.12.1-py35_0.tar.bz2
https://repo.continuum.io/pkgs/free/linux-64/openssl-1.0.2k-1.tar.bz2
https://repo.continuum.io/pkgs/free/linux-64/pip-9.0.1-py35_1.tar.bz2
https://repo.continuum.io/pkgs/free/linux-64/python-3.5.3-1.tar.bz2
https://repo.continuum.io/pkgs/free/linux-64/readline-6.2-2.tar.bz2
https://repo.continuum.io/pkgs/free/linux-64/scipy-0.19.0-np112py35_0.tar.bz2
https://repo.continuum.io/pkgs/free/linux-64/setuptools-27.2.0-py35_0.tar.bz2
https://repo.continuum.io/pkgs/free/linux-64/sqlite-3.13.0-0.tar.bz2
https://repo.continuum.io/pkgs/free/linux-64/tk-8.5.18-0.tar.bz2
https://repo.continuum.io/pkgs/free/linux-64/wheel-0.29.0-py35_0.tar.bz2
https://repo.continuum.io/pkgs/free/linux-64/xz-5.2.2-1.tar.bz2
https://repo.continuum.io/pkgs/free/linux-64/zlib-1.2.8-3.tar.bz2
--------------------------------------------------------------------------------
CPU version
model name      : Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz

Yaroslav_B_
Principiante
3.200 Visualizações

So, I've tried running with intelpython from Distribution Update 2, same problem.

I've tried on several different Xeon V3's, same issue, but I only have access to V4 and Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz, no able to test on newer V3

(intel) yaroslav@2:~/temp$ intelpython ~/git0/stuff/svd_noconverge.py
Read 2458624 bytes from $url
SVD failure
SVD did not converge
--------------------------------------------------------------------------------
MKL version
b'Intel(R) Math Kernel Library Version 2017.0.2 Product Build 20170126 for Intel(R) 64 architecture applications'
--------------------------------------------------------------------------------
Conda version
# This file may be used to create an environment using:
# $ conda create --name <env> --file <this file>
# platform: linux-64
@EXPLICIT
https://repo.continuum.io/pkgs/free/linux-64/libgfortran-3.0.0-1.tar.bz2
https://repo.continuum.io/pkgs/free/linux-64/mkl-2017.0.1-0.tar.bz2
https://repo.continuum.io/pkgs/free/linux-64/numpy-1.12.1-py35_0.tar.bz2
https://repo.continuum.io/pkgs/free/linux-64/openssl-1.0.2k-1.tar.bz2
https://repo.continuum.io/pkgs/free/linux-64/pip-9.0.1-py35_1.tar.bz2
https://repo.continuum.io/pkgs/free/linux-64/python-3.5.3-1.tar.bz2
https://repo.continuum.io/pkgs/free/linux-64/readline-6.2-2.tar.bz2
https://repo.continuum.io/pkgs/free/linux-64/scipy-0.19.0-np112py35_0.tar.bz2
https://repo.continuum.io/pkgs/free/linux-64/setuptools-27.2.0-py35_0.tar.bz2
https://repo.continuum.io/pkgs/free/linux-64/sqlite-3.13.0-0.tar.bz2
https://repo.continuum.io/pkgs/free/linux-64/tk-8.5.18-0.tar.bz2
https://repo.continuum.io/pkgs/free/linux-64/wheel-0.29.0-py35_0.tar.bz2
https://repo.continuum.io/pkgs/free/linux-64/xz-5.2.2-1.tar.bz2
https://repo.continuum.io/pkgs/free/linux-64/zlib-1.2.8-3.tar.bz2
--------------------------------------------------------------------------------
CPU version
model name      : Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz

Oleksandr_P_Intel
Funcionário
3.200 Visualizações

Dear Yaroslav,

I was able to reproduce the problem you are experiencing, and would like to thank you for your time to bring it to our attention.

The issue is under investigation by the MKL team. As a work-around, please try lowering the number of threads used by MKL. 

Running on the same hardware, E5 2630 v3, I saw SVD converge using MKL_NUM_THREADS=15

$ MKL_NUM_THREADS=15 python svd_nonconverge.py
Read 2458624 bytes
Success

Best regards,
Oleksandr

Eric_L_2
Principiante
3.200 Visualizações

Although Yaroslav's array works fine on my machine (Kabylake i7-7700K), I can confirm that my previously reported (see post above from March 9) failure file (bad_kabylake.npy) fails with MKL_NUM_THREADS=4 but passes with MKL_NUM_THREADS=3.

 

sergio_r_
Novato
3.200 Visualizações

 

Each one of the call functions:
  linalg.svd(np.load(arrayname), full_matrices=False)
  linalg.svd(np.load(arayname), full_matrices=False, lapack_driver='gesvd')
  svd(np.load(arrayname), full_matrices=False)
with arraynames = ['fail.npy', 'bad_avx.npy', 'bad.npy', 'bad_sse42_2.npy'] and the Yaroslav's array runs in my setup:

     CPU info
     --------
model name    : Intel(R) Core(TM) i7-4700MQ CPU @ 2.40GHz
     MKL info
     --------
b'Intel(R) Math Kernel Library Version 2018.0.0 Beta Build 20170316 for Intel(R) 64 architecture applications'


Python distribution details:
3.5.2 |Intel Corporation| (default, Mar 27 2017, 10:34:52)
[GCC 4.8.2 20140120 (Red Hat 4.8.2-15)]


Installed Python Version is: 3.5.2
Installed Numpy version is: 1.11.3
Installed Scipy version is: 0.18.1


Salut,

Sergio
Enhance your #MachineLearning and #BigData skills via #Python #SciPy
1) https://www.packtpub.com/big-data-and-business-intelligence/numerical-and-scientific-computing-scipy-video
2) https://www.packtpub.com/big-data-and-business-intelligence/learning-scipy-numerical-and-scientific-computing-second-edition

 

Yaroslav_B_
Principiante
3.200 Visualizações

Sergio R -- you need Xeon E5 2630 v3 to reproduce it, it runs fine on other Xeon's even

Eric_L_2
Principiante
3.200 Visualizações

The failure persists on MKL (and Parallel Studio) 2017 Update 4.

Eric_L_2
Principiante
3.200 Visualizações

I still have the failure on my KabyLake system with the latest version (icc --version 18.0.1 20171018).

Ribes__Albert
Principiante
3.200 Visualizações

I could not reproduce with none of the ndarray posted in this issue. But I came across another one which pails in my laptop. It is this one: 

problematic_array.zip

When I ran it with 

OMP_NUM_THREADS=1 python

It worked Ok, and it also worked when I switched to OpenBlas

Responder