numpy exp function performance

Kuznetsov__Dmitry · ‎02-26-2018

Hello,

I've been using Intel Distribution for Python 2.7 for a while and noticed one performance issue I'd like to get explanation about.

In my application in the most performance critical part I need to generate complex exponent array. I used numpy.exp function for this purpose but noticed performance degradation regarding to that. On my laptop I compare regular python 2.7 and python 2.7 from Intel. I tried to encapsulate the problem and came up with a separate test file:

import numpy as np
import timeit

def setup_arg():
    global arg_real
    global arg_comp
    arg_real = -2 * np.pi * np.arange(0.0, 58000 * 20, dtype=np.float32)
    arg_comp = -2j * np.pi * np.arange(0.0, 58000 * 20, dtype=np.float32)

def test_exp():
    df_exp = np.exp(arg_comp)

def test_sincos():
    df_exp = np.cos(arg_real) + 1j * np.sin(arg_real)


print('Exp: ' + str(timeit.timeit(test_exp, setup=setup_arg, number=100)))
print('sin_cos: ' + str(timeit.timeit(test_sincos, setup=setup_arg, number=100)))

As you can see I'm comparing two implementation of exponent generation using exp function and sum of sin and cos. Input is a pretty big array, generation of which is out of time measuring (in setup function).

That's what I have using regular python:

Exp: 9.07382556325

sin_cos: 8.29594172709

And that's what I have using Intel python:

Exp: 9.39746191715

sin_cos: 3.05530268142

That's great that sin and cos work so much faster! But what's wrong with exp? Why one exp function is much slower that two sin/cos (I know they have different types of argument: complex and real, but still)? Why there is no exp improvement but there is a degradation in exp function comparing with regular python. Basically in my original SW I see 1.5 times degradation, though here it's not noticable. In this benchmark I see that exp function promises hundreds times of improvement.

I use Windows 10, my CPU is I5 5300-U @2.3GHz

Thank you

Oleksandr_P_Intel · ‎02-26-2018

Sine and cosine were optimized in the Intel Distribution for Python to use MKL to thread evaluation of elemental real-valued elementary functions over values of arrays, which explains why it is faster to use sin/cos. That gain can be further improved on at the expense of making the code less readable. The gain comes from avoiding creation of intermediate temporary arrays and needless copying (casting).

Before answering why the complex exponential got a little slower, relative to stock Python, let me first explain why test_sincos is faster than direct exponentiation even in stock Python. Complex exponential has to deal with both real and imaginary parts of the input, and misses on the opportunity to save work, knowing that real part of the argument is always zero.

The NumPy in Intel Distribution for Python is compiled using Intel C Compiler, while PyPI NumPy is compiled using GCC. Due to lack of C99 support across supported platforms and compilers, NumPy does not use C99 complex types, but rather rolls its own data-type and implements its own operations on it. It turns out that Intel C Compiler is generating slightly less optimal code for working with these structures than GCC does. Intel C Compiler developers were notified of the discrepancy.

Thank you for taking the time to bring the issue to our attention.

Sincerely,
Oleksandr

In [1]: import numpy as np
   ...: def test_sincos(x):
   ...:     return (np.cos(x) + 1j * np.sin(x))
   ...:

In [2]: def test_sincos2(x):
   ...:     df_exp = np.empty(x.shape, dtype=np.csingle)
   ...:     trig_buf = np.cos(x)
   ...:     df_exp.real[:] = trig_buf
   ...:     np.sin(x, out=trig_buf)
   ...:     df_exp.imag[:] = trig_buf
   ...:     return df_exp
   ...:

In [3]: x = -2 * np.pi/(58000*20) * np.arange(0, 58000 * 20, dtype=np.float32)

In [4]: %timeit test_sincos(x)
24.7 ms ± 2.29 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [5]: %timeit np.exp(1j*x)
62.1 ms ± 1.73 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [6]: %timeit test_sincos2(x)
16 ms ± 1.35 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

Kuznetsov__Dmitry · ‎02-26-2018

Thank you for very detailed answer, Oleksandr.