Intel® Distribution for Python*
Engage in discussions with community peers related to Python* applications and core computational packages.
422 Discussions

Intel® Distribution for Python 2017 Update 2 accelerates five key areas for impressive performance gains

Sergey_M_Intel2
Employee
538 Views

Intel Corporation is pleased to announce the release of Intel® Distribution for Python* 2017 Update 2, which offers both performance improvements and new features. 


Update 2 offers great performance improvements for NumPy*, SciPy*, and Scikit-learn* that you can see across a range of Intel processors, from Intel® Core™ CPUs to Intel® Xeon® and Intel® Xeon Phi™ processors. 


Benchmarks for all these accelerations will be published soon. This post provides a preview of the nature, extent, and impact to you. 

Fast Fourier Transforms
In addition to initial Fast Fourier Transforms (FFT) optimizations offered in previous releases, Update 2 brings widespread optimizations for NumPy and SciPy FFT. It offers a layered interface for the Intel® Math Kernel Library (Intel® MKL) that allows efficient access to native FFT optimizations from a range of NumPy and SciPy functions. The optimizations include real and complex data types, both single and double precision. Update 2 covers both 1D and multidimensional data, in place and out of place. As a result, performance may improve up to 60x over Update 1 and is now close to native C/Intel MKL.


Arithmetic and transcendental expressions
NumPy is designed for high-performance basic arithmetic and transcendental operations on ndarrays. Some umath primitives are optimized to benefit from SSE, AVX and (recently) from AVX2 instruction sets, but not from AVX-512. Also, original NumPy functions did not take advantage of multiple cores. Update 2 provides substantial changes to the guts of NumPy to incorporate the Intel MKL Vector Math Library (VML) in respective umath primitives, which enables support for all available cores on a system and all CPU instruction sets. 


The logic in Update 2 NumPy umath works as follows:
•    For short NumPy arrays, the overheads to distribute work across multiple threads are high relative to the amount of computation work. In such cases, Update 2 uses the Intel MKL Short Vector Math Library (SVML), which is optimized for good performance across a range of Intel CPUs on short vectors. 
•    For large arrays, threading overheads are lower compared to the amount of computation and Update 2 uses the Intel MKL VML, which is optimized for utilizing multiple cores and a range of Intel CPUs.
NumPy Arithmetic and transcendental operations on vector-vector and vector-scalar are accelerated up to 400x for Intel® Xeon™ Phi processors.

Memory management optimizations
Update 2 introduces widespread optimizations in NumPy memory management operations. As a dynamic language, Python manages memory for the user. Memory operations, such as allocation, de-allocation, copy, and move, affect performance of essentially all Python programs. 

Specifically, Update 2 ensures NumPy allocates arrays that are properly aligned in memory on Linux, so that NumPy and SciPy compute functions can benefit from respective aligned versions of SIMD memory access instructions. This is especially relevant for Intel® Xeon™ Phi processors.
The most significant improvements in memory optimizations in Update 2 comes from replacing original memory copy and move operations with optimized implementations from Intel MKL. The result: improved performance because these Intel MKL routines are optimized for both a range of Intel CPUs and multiple CPU cores.

Faster Machine Learning with Scikit-learn
Scikit-learn is among the most popular Python machine learning packages. The initial release of Intel Distribution for Python provided Scikit-learn optimizations via respective NumPy and SciPy functions accelerated by Intel MKL. Update 2 optimizes selective key machine learning algorithms in Scikit-learn, accelerating them with the Intel® Data Analytics Acceleration Library (Intel® DAAL).

Specifically, Update 2 optimizes Principal Component Analysis (PCA), Linear and Ridge Regressions, Correlation and Cosine Distances, and K-Means. Speedups may range from 1.5x to 160x.

Intel-optimized Deep Learning
Deep learning is becoming an essential tool for knowledge discovery. Intel engineers put much effort into optimizing the most popular Deep Learning frameworks. Update 2 incorporates two Intel-optimized Deep Learning frameworks, Caffe* and Theano*, into the distribution so Python users can take advantage of these optimizations out of the box.

Neural network enhancements for pyDAAL
Intel DAAL:
•    Introduces a number of extensions for neural networks, such as the transposed convolution layer and the reshape layer. 
•    Now supports input tensors of arbitrary dimension in loss softmax cross-entropy layers, sigmoid cross-entropy criterion, and truncated Gaussian initializer for tensors.  
•    Extends support for distributed computing by adding the objective function with pre-computed characteristics.  
pyDAAL comes with improved performance for neural network layers used in topologies such as AlexNet. 

Summary
The Intel Distribution for Python is powered by Anaconda* and conda build infrastructures that give all Python users the benefit of interoperability within these two environments and access to the optimized packages through a simple conda install command.
Intel Distribution for Python 2017 Update 2 delivers significant performance optimizations for many core algorithms and Python packages, while maintaining the ease of download and install. 


Update 2 is available for free download at the Intel Distribution for Python website or through the Intel channel at Anaconda.org. 
The Python team at Intel welcomes you to try it out and email us any feedback. 

 

0 Kudos
1 Reply
Jacquemier__Jean
Beginner
538 Views

Dear Sergey,

Concerning Memory management optimizations introduced in Update 2. How does it deal with different processor extension, (sse4, avx, avx2, ...) vector size?
Does Numpy array are systematically aligned  on 64 bytes ?
Or does the extension is check at run time and thus data alignment  are diffreent for different CPU  extensions ?

0 Kudos
Reply