topic may be you make sense to try in Intel® oneAPI Math Kernel Library

Sum along specific matrix axis

Zhang__Hao — Fri, 09 Nov 2018 16:42:59 GMT

I am working on a project where I want to accelerate numpy element-wise multiplication and sum.

What I am doing is transfer the numpy array to C pointer and use MKL function to accelerate them(through cython)

For element-wise multiplication I have got the vdmul function. However when I check for sum there is no suitable function

in MKL which could sum a matrix along its specific axis and return a smaller matrix.

Example:

input: matrix A, shape is [100,200,300]

B = sum(A, axis = 0)

B shape is [200,300]

Could anyone give some advice? Thank you very much!

may be you make sense to try

Gennady_F_Intel — Sat, 10 Nov 2018 04:09:37 GMT

may be you make sense to try the IDP ( Intel Distribution Package) witch will help ( probably will help) you to see perf benefits without changing the original Python code.

Quote:Gennady F. (Intel)

Zhang__Hao — Sun, 25 Nov 2018 22:52:57 GMT

Gennady F. (Intel) wrote:
may be you make sense to try the IDP ( Intel Distribution Package) witch will help ( probably will help) you to see perf benefits without changing the original Python code.

I have tested the IDP and found that numpy sum has almost same speed compared to original python. Actually they are both one threaded as I test them. Compared to another numpy function multiply, which is meant for matrix element-wise multiplication, IDP version will use 4 thread in my PC(I7-6700HQ) while original python only use 1 thread.

My original purpose is that as numpy sum is single threaded, I want to fully optimise it with multithreading, Do you have any other recommendations? Thanks very much!

MKL doesn't include plain sum

TimP — Mon, 26 Nov 2018 09:56:43 GMT

MKL doesn't include plain sum functions for the reason, that there is no possibility in the usual cases to improve on the performance of optimized C or Fortran compiled code. Multi-threading would improve performance only in the case where you have multiple memory controllers (multiple CPU platform) and have taken care to avoid remote memory access, by summing only on the stride 1 extent of the matrix, and keeping the largest stride extents consistently local to a single memory controller (CPU). This is probably not a sufficiently practical usage case to justify supporting in MKL, but would be no more difficult to support with your C or Fortran compilation than it would be with an MKL function.