I decided to try out the Intel Python version. Installed Conda, all libraries, etc. However, when I run my code it is considerably slower than the python 3.5 version that comes with Ubuntu 16.04 together with Numpy installed via pip. I use Joblib to perform cross-validation and with the intel Python, it takes almost a 100% more time using 2 cores(2jobs). Using without joblib the Intel Python takes roughly 1.4-1.5 more time than the OS Python. I run on a Intel® Xeon(R) CPU E5-1603 0 @ 2.80GHz × 4 with 8GB of memory.
Well I use the Intel Xeon CPU and I have no idea if I am using the SSEX instructions so I am not sure this applies. All I am doing is massive amounts of matrix multiplications.
It is likely you are running into issues of over-subscription. Joblib is dispatching parts of work to different processes, each of which calls MKL's GEMM function, which itself is multi-threaded.
By default MKL would utilize as many threads as the number of cores available on your machine. Hence with each extra concurrent processes spawned by joblib, your computation create more threads than the processor can service, they contend for resources, and slow-down ensues.
There are 3 possible ways to approach the problem while using the Intel Distribution for Python.
1. Disable application level parallelism, i.e. do not use joblib. This is suboptimal is your application has significant sequential segements.
2. Use MKL in sequential mode, by running ``env MKL_THREADING_LAYER=sequential python your_script.py``, which is suboptimal if your application has serial regions (not running in parallel), which use numpy/MKL.
3. Use package TBB included in the Intel Distribution for Python: ``python -m tbb --ipc your_script.py``. This should achieve the best of both worlds. Alternatively you could try SMP package (``conda install -c intel smp``), which should mitigate the oversubscription by running ``python -m smp your_script.py``
Please let us know if you run into further issues.