I've been using Intel Distribution for Python 2.7 for a while and noticed one performance issue I'd like to get explanation about.
In my application in the most performance critical part I need to generate complex exponent array. I used numpy.exp function for this purpose but noticed performance degradation regarding to that. On my laptop I compare regular python 2.7 and python 2.7 from Intel. I tried to encapsulate the problem and came up with a separate test file:
import numpy as np import timeit def setup_arg(): global arg_real global arg_comp arg_real = -2 * np.pi * np.arange(0.0, 58000 * 20, dtype=np.float32) arg_comp = -2j * np.pi * np.arange(0.0, 58000 * 20, dtype=np.float32) def test_exp(): df_exp = np.exp(arg_comp) def test_sincos(): df_exp = np.cos(arg_real) + 1j * np.sin(arg_real) print('Exp: ' + str(timeit.timeit(test_exp, setup=setup_arg, number=100))) print('sin_cos: ' + str(timeit.timeit(test_sincos, setup=setup_arg, number=100)))
As you can see I'm comparing two implementation of exponent generation using exp function and sum of sin and cos. Input is a pretty big array, generation of which is out of time measuring (in setup function).
That's what I have using regular python:
Sine and cosine were optimized in the Intel Distribution for Python to use MKL to thread evaluation of elemental real-valued elementary functions over values of arrays, which explains why it is faster to use sin/cos. That gain can be further improved on at the expense of making the code less readable. The gain comes from avoiding creation of intermediate temporary arrays and needless copying (casting).
Before answering why the complex exponential got a little slower, relative to stock Python, let me first explain why test_sincos is faster than direct exponentiation even in stock Python. Complex exponential has to deal with both real and imaginary parts of the input, and misses on the opportunity to save work, knowing that real part of the argument is always zero.
The NumPy in Intel Distribution for Python is compiled using Intel C Compiler, while PyPI NumPy is compiled using GCC. Due to lack of C99 support across supported platforms and compilers, NumPy does not use C99 complex types, but rather rolls its own data-type and implements its own operations on it. It turns out that Intel C Compiler is generating slightly less optimal code for working with these structures than GCC does. Intel C Compiler developers were notified of the discrepancy.
Thank you for taking the time to bring the issue to our attention.
In : import numpy as np ...: def test_sincos(x): ...: return (np.cos(x) + 1j * np.sin(x)) ...: In : def test_sincos2(x): ...: df_exp = np.empty(x.shape, dtype=np.csingle) ...: trig_buf = np.cos(x) ...: df_exp.real[:] = trig_buf ...: np.sin(x, out=trig_buf) ...: df_exp.imag[:] = trig_buf ...: return df_exp ...: In : x = -2 * np.pi/(58000*20) * np.arange(0, 58000 * 20, dtype=np.float32) In : %timeit test_sincos(x) 24.7 ms ± 2.29 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) In : %timeit np.exp(1j*x) 62.1 ms ± 1.73 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) In : %timeit test_sincos2(x) 16 ms ± 1.35 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)