Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Intel Community
- Software
- Software Development SDKs and Libraries
- Intel® oneAPI Math Kernel Library
- omp threading of sparse matrix operation does not yield expected seed up. reason?

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

may_ka

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

09-18-2018
07:22 PM

39 Views

omp threading of sparse matrix operation does not yield expected seed up. reason?

Hi all,

I have implemented a kind of a gauss seidel algorithm for a sparse upper triangular coefficient matrix, where multiplication of a vector with a the sparse matrix (forward propagation) is a core feature. For some application I have to do this process for several independent vectors (memory location), where the spares coefficient matrix remains unchanged (read only access). Therefore I thought this process might be executable in parallel depending on the number of vectors. I expected an almost linear increase in speed as long as the number of vectors does not exceed the number of cores. However, I observed only a marginal increase in speed. Here is the code:

Module mod_tmp Type :: csrmatrixuppertri integer :: nrows, ncols integer, allocatable :: colpos(:), rowpos(:) real, allocatable :: value(:) end type csr contains Subroutine SubGlobal(a,c) implicit none Type(csr), intent(in) :: c Real, Intent(inout) :: a(:,:) Integer :: isnt, c1 !$ isnt=minval((/omp_get_max_threads(),size(a,2)/)) !$OMP PARALLEL DO PRIVATE(c1) num_threads(isnt) Do c1=1,size(a,2) call sublocal(a=a(:,c1),c=c) End Do !$OMP END PARALLEL DO End Subroutine SubGlobal Subroutine SubLocal(av,c) Implicit none Type(csr), intent(in) :: c Real, Intent(inout) :: av(:) Integer :: c2, offdiagstart, offdiagend, diag Do c2=1,c%nrows !!get locations in csr diag=c%rowpos(c2) offdiagstart=c%rowpos(c2)+1 offdiagend=c%rowpos(c2+1)-1 !!calculate a new value av(c2)=av(c2)/c%value(diag) !!forward propagate av(c%colpos(offdiagstart:offdiagend))=& &av(c%colpos(offdiagstart:offdiagend))+& &c%value(offdiagstart:offdiagend)*(av(c2)) End do End Subroutine SubLocal End Module mod_tmp

Threading sets in in "SubGlobal" by calling "SubLocal" as many times as "a" as columns, operations are carried out in "SubLocal". All writeable memory in "SubLocal" is contiguous and should accessible only be one thread. The csr matrix should be public to all threads, but only readable.

Using a matrix "a" of dimension "a(c%nrows,10), I am wondering why I don't see a decrease in processing time to roughly 1/10 when going from single core to 10 cores (with 36 cores being available). I suspect that the shared read is hampering the speed, but I might be wrong. Giving every thread a local copy of "c" is not feasible due to memory limitations (c is actually >100GB).

Any suggestions are highly appreciated.

Cheers

Link Copied

0 Replies

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

For more complete information about compiler optimizations, see our Optimization Notice.