- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am working on a project where I want to accelerate numpy element-wise multiplication and sum.
What I am doing is transfer the numpy array to C pointer and use MKL function to accelerate them(through cython)
For element-wise multiplication I have got the vdmul function. However when I check for sum there is no suitable function
in MKL which could sum a matrix along its specific axis and return a smaller matrix.
Example:
input: matrix A, shape is [100,200,300]
B = sum(A, axis = 0)
B shape is [200,300]
Could anyone give some advice? Thank you very much!
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
may be you make sense to try the IDP ( Intel Distribution Package) witch will help ( probably will help) you to see perf benefits without changing the original Python code.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Gennady F. (Intel) wrote:may be you make sense to try the IDP ( Intel Distribution Package) witch will help ( probably will help) you to see perf benefits without changing the original Python code.
I have tested the IDP and found that numpy sum has almost same speed compared to original python. Actually they are both one threaded as I test them. Compared to another numpy function multiply, which is meant for matrix element-wise multiplication, IDP version will use 4 thread in my PC(I7-6700HQ) while original python only use 1 thread.
My original purpose is that as numpy sum is single threaded, I want to fully optimise it with multithreading, Do you have any other recommendations? Thanks very much!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
MKL doesn't include plain sum functions for the reason, that there is no possibility in the usual cases to improve on the performance of optimized C or Fortran compiled code. Multi-threading would improve performance only in the case where you have multiple memory controllers (multiple CPU platform) and have taken care to avoid remote memory access, by summing only on the stride 1 extent of the matrix, and keeping the largest stride extents consistently local to a single memory controller (CPU). This is probably not a sufficiently practical usage case to justify supporting in MKL, but would be no more difficult to support with your C or Fortran compilation than it would be with an MKL function.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page