- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Considering "The Intel® Math Kernel Library Sparse Matrix Vector Multiply Format Prototype Package", I have two questions:

1) The use of a temporary array for each thread may not pay off when the number of threads increases, especially when Xeon Phi is considered. Is this problem efficiently solved in the package? In which call are memory allocated for the temporary arrays?** **In sparseCreateESBMatrix(), sparseDcsr2esb() or sparseDesbmv() function?

2) How is the reduction of temporary arrays performed on Xeon Phi? Is there any architecture-specific way? As far as I know, it is not mentioned in the paper describing the package.

Link Copied

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Kadir,

One of our MKL team is looking at your question and will get back to you soon.

Regards

--

Tayulor

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Hi Kadir,

The package performs sparse matrix-vector multiplication for general case only, i.e. y = alpha * A * x + beta * y. So this case requires almost no synchronization between threads. For ESB functionality we need to allocate the following arrays in sparseDcsr2esb() function: arrays for matrix representation in ESB format and some arrays for workload balancing and synchronization - sizes of these arrays depend on sparse matrix structure and don't depend on number of threads.

Because synchronization may require for consecutive threads only, so SpMV functionality should demonstrate good scalability on Intel Xeon Phi.

Could you please clarify your questions about temporary arrays?

Regards, Sergey

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

2nd paragraph of Subsection 4.3 entitled "SpMV Kernel with ESB Format" taken from the paper describing the package is as follows:

"In practice, we also parallelize the SpMV operation across col-

umn blocks. Here, we use a temporary copy of y for each block and

use a reduce operation across these temporary copies at the end of

the computation."

Caption of Algorithm 2 is as follows:

"Algorithm 2: Multiply the i th column block of matrix in ESB

format"

As far as I understand, there is a synchronization point after the multiplication of column blocks. After the synchronization point, the temporary copy of y for each block undergoes a reduction operation.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Hi Kadir,

Matrix in ESB format is stored in slices: each slice consists of 8 rows and stored in ELLPACK format. In fact, each thread computes update for output vector Y in the form of vector register variable of 512 bit length. So when several threads computed parts of the same slice (parallelize across column blocks), they need to synchronize update of just these small variables. So we don't need to allocate a lot of additional memory for every thread.

Regards, Sergey

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

How are the synchronizations and atomic updates implemented? Performance of #pragma omp atomic is too much low. Are there any other efficient solutions for synchronizations and atomic updates when there are more than 200 threads on Intel Xeon Phi? I was not able to find any information about implementation of the synchronizations and atomic updates in the paper.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page