- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am wondering which algorithm and parallelization strategy is utilized in mkl_?csrmultcsr? Does this function scale well on Intel Xeon Phi architecture?
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Csrmultcsr utilizes simple parallelization strategy (row-wise for the first matrix).
Scaling is heavily depend on the matrix structures and sizes, and normally, the scalability is bound by the memory subsystem bandwidth.
We would expect to see better performance on Xeon Phi vs. Xeon, but scalability would normally be quite modest as memory bandwidth is exhausted pretty quickly.
--Vipin
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I suppose that a chunk of rows of the first matrix is assigned to a thread. Is it possible to set chunk size and OpenMP's scheduling policy? Are there any other user-supplied parameters to reduce run time of mkl_?csrmultcsr on MIC? (Parameters other than OMP_NUM_THREADS and KMP_AFFINITY)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
mkl_?csrmultcsr has the following parallelization strategy for non-transposed multiplication of two CSR matrices:first matrix is divided on chunks with more or less equal number of rows, and every chunk is assigned to a thread. Since the number of chunks is equal to number of threads, the chunk size can’t be set by the user outside MKL.
Could you please provide us with the use case of this function? Matrix sizes, sparsity pattern, input parameters, etc?
We can take a look at the testcase and then suggest possible steps in further tuning.
--Vipin

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page