Use numba in intel python environment, but get more running time and memory error

Ruizhe_L_ · ‎10-18-2017

I am using Anaconda with python 3.6. Firstly, I have a code as below to create a 3d matrix. It is basically evaluate a phase term over the matrix, and the phase term is determined by location in the matrix.

g = lambda p,m,n: np.exp(2j*np.pi*(kz*p+kx*m+ky*n)).astype(np.complex64)
shift = np.fromfunction(g,(2*nz,2*nx,2*ny))
shift = np.roll(np.roll(np.roll(shift,nz,0),nx,1),ny,2)

kz,kx,ky are known parameters as float number. nz,nx,ny are the scale of the matrix.

Then I have a faster one doing same thing based on numba.jit.

x = np.arange(2*nx)
y = np.arange(2*ny)
z = np.arange(2*nz)
zv, yv, xv = np.meshgrid(z,y, x)
@jit(nopython=True, parallel=True)
def f(z,x,y):
    a = np.exp(2j*np.pi*(kx*x+ky*y)).astype(np.complex64)
    return a
a = f(zv,xv,yv)
a = a.swapaxes(0,1)
a = np.roll(np.roll(np.roll(a,nz,0),nx,1),ny,2)

This one is faster than the former one, but I am looking forward to an even faster one.

I installed intel python distribution. But I find that these two codes above don't work anymore. In IPD environment, the first code turns out to be "Memory Error", the second one costs super long time and finally kernel died.

Can anyone point out why?

Oleksandr_P_Intel · ‎10-19-2017

Could you please indicate the amount of memory on the machine where you experience the issue in question, as well as ball-park values for nx, ny, nz.

Essentially, we need to be able to reproduce the issue with a reasonable effort to provide an answer.

Please also indicate which version of Intel Distribution for Python you are using. You can find this out by including the output of `conda list intelpython`.

Thank you

Ruizhe_L_ · ‎10-19-2017

kx = -0.21445783132530122
ky = -0.21445783132530122
kz = 0

nx = 512
ny = 512
nz = 390

I am using Anaconda3-5.0.0-Windows-x86_64 and Intel w_python3_p_2018.0.018, both are the latest version online.

The computer has Intel Core i7-7700HQ Quad Core cpu and 16G ram.

Ruizhe_L_ · ‎10-19-2017

Oleksandr P. (Intel) wrote:

Could you please indicate the amount of memory on the machine where you experience the issue in question, as well as ball-park values for nx, ny, nz.

Essentially, we need to be able to reproduce the issue with a reasonable effort to provide an answer.

Please also indicate which version of Intel Distribution for Python you are using. You can find this out by including the output of `conda list intelpython`.

Thank you

kx = -0.21445783132530122
ky = -0.21445783132530122
kz = 0

nx = 512
ny = 512
nz = 390

I am using Anaconda3-5.0.0-Windows-x86_64 and Intel w_python3_p_2018.0.018, both are the latest version online.

The computer has Intel Core i7-7700HQ Quad Core cpu and 16G ram.

Oleksandr_P_Intel · ‎10-19-2017

OK, the array `shift` is 512*512*390 complex floats, which amounts to about 6GB of memory, so you should be able to fit it in, but any code that creates a lot of intermediate arrays may cause `MemoryError` exception.

The `np.fromfunction` passes to function `g` three arrays of doubles, containing arrays `p`, `m`, `n` each of shape `(2*nz, 2*nx, 2*ny)`, hence each taking 6GB of memory each.

Computing `kz*p + kx*m + ky*n` verbatim creates multiple intermediate arrays for each sub-operation, then two calls to `np.multiply` generate more intermediate arrays, then a call to `np.exp` creates another intermediate array of complex doubles, which is then cast into a newly allocated array for compex singles.

Using Anaconda 5, and Intel Distribution for Python, on Intel(R) Xeon(R) CPU E5-2698 v3 @ 2.30GHz with 64GB of RAM, I get

# for Anaconda 5
In[3]: %time shift = np.fromfunction(g,(2*nz,2*nx,2*ny))
CPU times: user 1min 14s, sys: 31.6 s, total: 1min 46s
Wall time: 1min 48s

# for IDP 2018.0.0
In[3]: %time shift = np.fromfunction(g,(2*nz,2*nx,2*ny))
CPU times: user 2min, sys: 26.1 s, total: 2min 26s
Wall time: 1min 31s

However, writing those steps individually, I am able to achieve better performance and use less memory:

def g2(p,m,n):
    tmp = kz*p
    ph = tmp.copy()
    np.copyto(tmp, m)
    tmp *= kx
    ph += tmp
    np.copyto(tmp, n)
    tmp *= ky
    ph += tmp
    ph *= 2*np.pi
    np.cos(ph, out=tmp)
    r = np.empty(tmp.shape, np.singlecomplex)
    r.real[:] = tmp
    np.sin(ph, out=tmp)
    del ph
    r.imag[:] = tmp
    return r

Now, running

# Anaconda 5
In[4]: %time shift2 = np.fromfunction(g2,(2*nz,2*nx,2*ny))
CPU times: user 54.2 s, sys: 18.5 s, total: 1min 12s
Wall time: 1min 12s


# IDP 2018.0.0
In[4]: %time shift2 = np.fromfunction(g2,(2*nz,2*nx,2*ny))
CPU times: user 1min 24s, sys: 17.6 s, total: 1min 42s
Wall time: 16.3 s

Similarly, in order to be mindful of intermediate expressions, you should perform `np.roll` on separate lines.

Furthermore, it is probably best to create `shift` array in blocks, and combine result.

Anton_M_Intel · ‎10-19-2017

I used a smaller machine and got MemoryError even for your numpy code:

In [1]: import numpy as np
   ...: (kx, ky, kz, nx, ny, nz) = (-0.21445783132530122, -0.21445783132530122, 0, 512, 512, 390)
   ...:

In [2]: g = lambda p,m,n: np.exp(2j*np.pi*(kz*p+kx*m+ky*n)).astype(np.complex64)

In [3]: %time shift = np.fromfunction(g,(2*nz,2*nx,2*ny))
---------------------------------------------------------------------------
MemoryError                               Traceback (most recent call last)
<timed exec> in <module>()

/home/miniconda3/envs/intel3/lib/python3.6/site-packages/numpy/core/numeric.py in fromfunction(function, shape, **kwargs)
   2130     dtype = kwargs.pop('dtype', float)
   2131     args = indices(shape, dtype=dtype)
-> 2132     return function(*args, **kwargs)
   2133
   2134

<ipython-input-2-89b938ec44bd> in <lambda>(p, m, n)
----> 1 g = lambda p,m,n: np.exp(2j*np.pi*(kz*p+kx*m+ky*n)).astype(np.complex64)

MemoryError:

then, I used Dask in order to split array in chunks:

In [10]: import dask
    ...: import dask.array as da

In [11]: %time shift = da.fromfunction(g, shape=(2*nz,2*nx,2*ny), dtype=np.double, chunks=(32,32,32)).compute(get=dask.local.get_sync)
CPU times: user 1min 7s, sys: 24.1 s, total: 1min 31s
Wall time: 1min 34s

It works!

As for whether Numba in IDfP works slower, we cannot reproduce it having sufficient memory on the machine for the computation.