- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have found a bug on Parallel Studio 16.0.2 where I get an error when computing the SVD with GESDD in the Python package SciPy. It can be reproduced on an MKL-built scipy with this array, which is finite (contains no NaN or inf) as:
>>> import numpy as np >>> from scipy import linalg >>> linalg.svd(np.load('fail.npy'), full_matrices=False) Traceback (most recent call last): File "<string>", line 1, in <module> File "/home/larsoner/.local/lib/python2.7/site-packages/scipy/linalg/decomp_svd.py", line 119, in svd raise LinAlgError("SVD did not converge") numpy.linalg.linalg.LinAlgError: SVD did not converge
I am curious if anyone has insight into why this fails, or can reproduce it themselves. I do have access to older MKL routines so if it's helpful I could see if I get the error elsewhere, too.
I have tried this with MKL-enabled Anaconda, and it does not fail, although I do experience similar failures with other arrays with the Anaconda version, which seem to only happen on systems with SSE4.2 but no AVX extensions.
I recently worked on SciPy's SVD routines to add a wrapper for a GESVD backend (to complement the existing GESDD routine) here, and this command passes on bleeding-edge SciPy, so it does seem to be a problem with the GESDD implementation specifically:
>>> linalg.svd(np.load('fail.npy'), full_matrices=False, lapack_driver='gesvd')
Link Copied
- « Previous
-
- 1
- 2
- Next »
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Found another matrix that fails on my system with SSE4.2 only (not AVX), oddly it does not fail on Anaconda but it does fail on my self-built NumPy/SciPy stack with latest release Parallel studio and MKL. One difference between Anaconda and my build is that Anaconda doesn't use ifort, it uses gfortran (if that matters). Uploaded as bad_sse42_2.npy.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I can confirm that with the latest release "icc (ICC) 17.0.0 20160721" I still have failures on my SSE42 machine with:
fail.npy
bad_sse42_3.npy
But it now works with:
bad.npy
bad_sse42_2.npy
So that's some progress at least. It's not easy for me to re-test the AVX failure (bad_avx.npy) so I'm not sure what the status is like there.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I can confirm the same behavior is happening on the 2017.0.098 MKL update, namely that SVD with "bad.npy" and "bad_sse42_2.npy" still fail on my system.
However, I noticed in testing that it only fails most of the time. In maybe 1/5 or 1/10 cases, it will actually pass. This suggests to me there is possibly some memory problem going on, where the code is overwriting some memory that it shouldn't, and it only causes problems some of the time. But I should stress that even though I'm compiling my own numpy/scipy stack, most of the other folks I know who hit this issue use Anaconda Python, which gets compiled elsewhere -- so I don't think it's an issue specific to only my setup.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Just wanted to mention that the problems persist with the latest Parallel Studio-compiled version (17.0.1 20161005), but now it only happens with `fail.npy` and `bad_sse42_2.npy`, and failures no longer seem random but instead consistent.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Eric,
Thanks for the updating.
I'm just update another gesdd issue https://software.intel.com/en-us/forums/intel-math-kernel-library/topic/675058, so update here too
zgesdd the routine will cause an access violation (segmentation fault) for certain sizes of the matrix.
There is bug in gesdd (Insufficient size of rwork array). the issue should be targeted to be fixed in next version. ( supposed 2017 update 2)
let's wait to see if 2017 update 2 work.
Thanks
Ying
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sure, I'll try it as soon as it comes out.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I just updated to 2017 update 2 (icc --version 17.0.2 20170213), and unfortunately the same behavior exists (fail.npy and bad_sse42_2.npy both fail to converge).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
My motherboard died, so I replaced my CPU / motherboard combo with an i7700k Kaby Lake. I rebuilt NumPy and SciPy to take advantage of newer extensions, and all of the old matrices passed on this architecture.
However, I quickly found a new example that fails, which I have uploaded as bad_kabylake.npy. Does anyone have this CPU to try to replicate? I'm on the latest version (2017 update 2), compiled NumPy and SciPy from source, and did:
scipy.linalg.svd(np.load('bad_kabylake.npy'), full_matrices=False)
Alternatively, I also just tested this on Ananconda Python, and get the same failure. So that should hopefully further reduce the difficulty of testing / replicating.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'm having similar looking failures, using Anaconda Python SVD (linked against MKL).
Here's a self-contained reproducible example:
https://github.com/yaroslavvb/stuff/blob/master/svd_noconverge.py
It fails on our Xeon V3 machines, passes on Xeon V4
Xeon V3 info
processor : 31
vendor_id : GenuineIntel
cpu family : 6
model : 63
model name : Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz
stepping : 2
microcode : 0x36
cpu MHz : 1214.906
cache size : 20480 KB
physical id : 1
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear Yaroslav,
Please use
import ctypes import numpy as np def mklVersion(): ver = np.zeros(199, dtype=np.uint8) mkl = ctypes.cdll.LoadLibrary("libmkl_rt.so") mkl.MKL_Get_Version_String(ver.ctypes.data_as(ctypes.c_char_p), 198) return ver[ver != 0].tostring() # mklVersion()
to find out the version of MKL installed on both the machine where it fails and where it passes.
Also please provide outputs of `conda list --explicit` in these environments.
I was not able to reproduce the failure on the slightly newer Xeon v3:
[08:18:59 linmachine tmp]$ head /proc/cpuinfo | grep 'model name' model name : Intel(R) Xeon(R) CPU E5-2698 v3 @ 2.30GHz
Using Intel Distribution for Python 2017 update 2, which has scipy 0.18.1 and numpy 1.11.2.
I was also unable to reproduce the problem on the same machine using Intel Distribution for Python 2017 update 1.
Thank you,
Oleksandr
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I've updated the file to print out this info: https://github.com/yaroslavvb/stuff/blob/master/svd_noconverge.py
Will try with Intel Distribution Update 2 and update this
Read 2458624 bytes from $url
SVD failure
SVD did not converge
--------------------------------------------------------------------------------
MKL version
b'Intel(R) Math Kernel Library Version 2017.0.1 Product Build 20161005 for Intel(R) 64 architecture applications'
--------------------------------------------------------------------------------
Conda version
# This file may be used to create an environment using:
# $ conda create --name <env> --file <this file>
# platform: linux-64
@EXPLICIT
https://repo.continuum.io/pkgs/free/linux-64/libgfortran-3.0.0-1.tar.bz2
https://repo.continuum.io/pkgs/free/linux-64/mkl-2017.0.1-0.tar.bz2
https://repo.continuum.io/pkgs/free/linux-64/numpy-1.12.1-py35_0.tar.bz2
https://repo.continuum.io/pkgs/free/linux-64/openssl-1.0.2k-1.tar.bz2
https://repo.continuum.io/pkgs/free/linux-64/pip-9.0.1-py35_1.tar.bz2
https://repo.continuum.io/pkgs/free/linux-64/python-3.5.3-1.tar.bz2
https://repo.continuum.io/pkgs/free/linux-64/readline-6.2-2.tar.bz2
https://repo.continuum.io/pkgs/free/linux-64/scipy-0.19.0-np112py35_0.tar.bz2
https://repo.continuum.io/pkgs/free/linux-64/setuptools-27.2.0-py35_0.tar.bz2
https://repo.continuum.io/pkgs/free/linux-64/sqlite-3.13.0-0.tar.bz2
https://repo.continuum.io/pkgs/free/linux-64/tk-8.5.18-0.tar.bz2
https://repo.continuum.io/pkgs/free/linux-64/wheel-0.29.0-py35_0.tar.bz2
https://repo.continuum.io/pkgs/free/linux-64/xz-5.2.2-1.tar.bz2
https://repo.continuum.io/pkgs/free/linux-64/zlib-1.2.8-3.tar.bz2
--------------------------------------------------------------------------------
CPU version
model name : Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
So, I've tried running with intelpython from Distribution Update 2, same problem.
I've tried on several different Xeon V3's, same issue, but I only have access to V4 and Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz, no able to test on newer V3
(intel) yaroslav@2:~/temp$ intelpython ~/git0/stuff/svd_noconverge.py
Read 2458624 bytes from $url
SVD failure
SVD did not converge
--------------------------------------------------------------------------------
MKL version
b'Intel(R) Math Kernel Library Version 2017.0.2 Product Build 20170126 for Intel(R) 64 architecture applications'
--------------------------------------------------------------------------------
Conda version
# This file may be used to create an environment using:
# $ conda create --name <env> --file <this file>
# platform: linux-64
@EXPLICIT
https://repo.continuum.io/pkgs/free/linux-64/libgfortran-3.0.0-1.tar.bz2
https://repo.continuum.io/pkgs/free/linux-64/mkl-2017.0.1-0.tar.bz2
https://repo.continuum.io/pkgs/free/linux-64/numpy-1.12.1-py35_0.tar.bz2
https://repo.continuum.io/pkgs/free/linux-64/openssl-1.0.2k-1.tar.bz2
https://repo.continuum.io/pkgs/free/linux-64/pip-9.0.1-py35_1.tar.bz2
https://repo.continuum.io/pkgs/free/linux-64/python-3.5.3-1.tar.bz2
https://repo.continuum.io/pkgs/free/linux-64/readline-6.2-2.tar.bz2
https://repo.continuum.io/pkgs/free/linux-64/scipy-0.19.0-np112py35_0.tar.bz2
https://repo.continuum.io/pkgs/free/linux-64/setuptools-27.2.0-py35_0.tar.bz2
https://repo.continuum.io/pkgs/free/linux-64/sqlite-3.13.0-0.tar.bz2
https://repo.continuum.io/pkgs/free/linux-64/tk-8.5.18-0.tar.bz2
https://repo.continuum.io/pkgs/free/linux-64/wheel-0.29.0-py35_0.tar.bz2
https://repo.continuum.io/pkgs/free/linux-64/xz-5.2.2-1.tar.bz2
https://repo.continuum.io/pkgs/free/linux-64/zlib-1.2.8-3.tar.bz2
--------------------------------------------------------------------------------
CPU version
model name : Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear Yaroslav,
I was able to reproduce the problem you are experiencing, and would like to thank you for your time to bring it to our attention.
The issue is under investigation by the MKL team. As a work-around, please try lowering the number of threads used by MKL.
Running on the same hardware, E5 2630 v3, I saw SVD converge using MKL_NUM_THREADS=15
$ MKL_NUM_THREADS=15 python svd_nonconverge.py Read 2458624 bytes Success
Best regards,
Oleksandr
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Although Yaroslav's array works fine on my machine (Kabylake i7-7700K), I can confirm that my previously reported (see post above from March 9) failure file (bad_kabylake.npy) fails with MKL_NUM_THREADS=4 but passes with MKL_NUM_THREADS=3.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Each one of the call functions:
linalg.svd(np.load(arrayname), full_matrices=False)
linalg.svd(np.load(arayname), full_matrices=False, lapack_driver='gesvd')
svd(np.load(arrayname), full_matrices=False)
with arraynames = ['fail.npy', 'bad_avx.npy', 'bad.npy', 'bad_sse42_2.npy'] and the Yaroslav's array runs in my setup:
CPU info
--------
model name : Intel(R) Core(TM) i7-4700MQ CPU @ 2.40GHz
MKL info
--------
b'Intel(R) Math Kernel Library Version 2018.0.0 Beta Build 20170316 for Intel(R) 64 architecture applications'
Python distribution details:
3.5.2 |Intel Corporation| (default, Mar 27 2017, 10:34:52)
[GCC 4.8.2 20140120 (Red Hat 4.8.2-15)]
Installed Python Version is: 3.5.2
Installed Numpy version is: 1.11.3
Installed Scipy version is: 0.18.1
Salut,
Sergio
Enhance your #MachineLearning and #BigData skills via #Python #SciPy
1) https://www.packtpub.com/big-data-and-business-intelligence/numerical-and-scientific-computing-scipy-video
2) https://www.packtpub.com/big-data-and-business-intelligence/learning-scipy-numerical-and-scientific-computing-second-edition
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sergio R -- you need Xeon E5 2630 v3 to reproduce it, it runs fine on other Xeon's even
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The failure persists on MKL (and Parallel Studio) 2017 Update 4.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I still have the failure on my KabyLake system with the latest version (icc --version 18.0.1 20171018).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I could not reproduce with none of the ndarray posted in this issue. But I came across another one which pails in my laptop. It is this one:
When I ran it with
OMP_NUM_THREADS=1 python
It worked Ok, and it also worked when I switched to OpenBlas
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- « Previous
-
- 1
- 2
- Next »