- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
why matrix multiply parallel programming runtime is slower than it's serial runtime. please tell me. Thank you.
void matrixp(int a,int b,int c)
{
int i,j,k,sum=0;
#pragma omp parallel private(i,j,k) // reduction(+:sum)
for(i=0;i for(j=0;j {
sum=0;
//c=0;
#pragma omp for reduction(+:sum)
for(k=0;k {
sum+=a*b;
c=sum;
}
}
}
void matrixp(int a
{
int i,j,k,sum=0;
#pragma omp parallel private(i,j,k) // reduction(+:sum)
for(i=0;i
sum=0;
//c
#pragma omp for reduction(+:sum)
for(k=0;k
sum+=a
c
}
}
}
Link Copied
9 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The usual slogans apply here (from 20 years ago, "concurrent outer vector inner"). You would organize the loops so as to use simd instructions in the inner loops, as an optimizing compiler would attempt to do in the absence of OpenMP specifications, while applying threading to the outermost loop, with the threads working on data which are several cache lines apart. omp reduction is definitely not an optimization for this purpose, even if you have chosen data types for which there aren't suitable simd instructions.
You would want at least to attempt improvement on the code which you would get from icc -O3 -parallel operating on source code such as you would find in libgfortran matmul, perhaps looking at both float and int data types, noting that this code expects you to substitute a netlib threaded BLAS equivalent when the problem size exceeds a threshold such that threading would become useful.
You would want at least to attempt improvement on the code which you would get from icc -O3 -parallel operating on source code such as you would find in libgfortran matmul, perhaps looking at both float and int data types, noting that this code expects you to substitute a netlib threaded BLAS equivalent when the problem size exceeds a threshold such that threading would become useful.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - tim18
The usual slogans apply here (from 20 years ago, "concurrent outer vector inner"). You would organize the loops so as to use simd instructions in the inner loops, as an optimizing compiler would attempt to do in the absence of OpenMP specifications, while applying threading to the outermost loop, with the threads working on data which are several cache lines apart. omp reduction is definitely not an optimization for this purpose, even if you have chosen data types for which there aren't suitable simd instructions.
You would want at least to attempt improvement on the code which you would get from icc -O3 -parallel operating on source code such as you would find in libgfortran matmul, perhaps looking at both float and int data types, noting that this code expects you to substitute a netlib threaded BLAS equivalent when the problem size exceeds a threshold such that threading would become useful.
You would want at least to attempt improvement on the code which you would get from icc -O3 -parallel operating on source code such as you would find in libgfortran matmul, perhaps looking at both float and int data types, noting that this code expects you to substitute a netlib threaded BLAS equivalent when the problem size exceeds a threshold such that threading would become useful.
tmpmat=transpose(vy(:25,:25))
c$omp parallel do private(j,i,tmp,tmp1)
do j= 1,n
i= 1
if(iand(25,1) == 1)then
px(1,j)= px(1,j)+dot_product(tmpmat(:,1),cx(:,j))
i= 2
endif
do i= i,24,2
tmp= 0
tmp1= 0
do k= 1,25
tmp= tmp+tmpmat(k,i)*cx(k,j)
tmp1= tmp1+tmpmat(k,i+1)*cx(k,j)
enddo
px(i,j)= px(i,j)+tmp
px(i+1,j)= px(i+1,j)+tmp1
enddo
enddo
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Tim,
I think you have a cut and paste problem
if(iand(25,1) == 1)then
is a constant expression that is always true
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Tim,
I think transposition of larger matricies will work well too provided you do not transpose the entire matrix at once.
i.e. transposing the number of columns that fit within the SSE registers may have merrit(2 for REAL(8) and 4 for REAL(4), and double that when AVX comes out). This would be true if you can pipeline the transpostion with the now vectorized multiplication. The ranspositioning can be done using the integer instruction set while the multiplication can use the SSE FP path. On HT systems this might show additional improvements.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Jim,
"if(iand(25,1) == 1)then
is a constant expression that is always true"
Yes, I put this in just so I don't forget that this line is conditional on the size of the matrix.
Thanks,
Tim
"if(iand(25,1) == 1)then
is a constant expression that is always true"
Yes, I put this in just so I don't forget that this line is conditional on the size of the matrix.
Thanks,
Tim
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
This code works well:
Just place your matrices as parameters.
[cpp]void OpenMPMatrixMultiply() { int i, j, k; #pragma omp parallel for private(j, k) for (i = 0; i < size1; i++) { for (j = 0; j < size3; j++) { int partial = 0; for (k = 0; k < size2; k++) { partial += matrix1* matrix2 ; } result1 += partial; } } }[/cpp]
Just place your matrices as parameters.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - tim18
do k= 1,25
tmp= tmp+tmpmat(k,i)*cx(k,j)
tmp1= tmp1+tmpmat(k,i+1)*cx(k,j)
enddo
tmp=dot_product(tmpmat(:,i),cx(:,j))
on Core 2. Register re-use apparently out-weighs lack of vectorization, with multi-threading.
Even on the Core i7, the vector code doesn't come out ahead when running 4 threads. On that platform, the compiler shouldn't be worrying about the alignments, yet it still reserves vectorization for the single dot product.
I wrote an SSE2 intrinsics version, but it's not worth the trouble.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - tim18
Compilers don't like to vectorize this, at least not with the odd leading dimension, apparently due to the inconsistent alignments of the array sections. This produces the curious result that the code I show is faster than the vectorized code,
tmp=dot_product(tmpmat(:,i),cx(:,j))
on Core 2. Register re-use apparently out-weighs lack of vectorization, with multi-threading.
Even on the Core i7, the vector code doesn't come out ahead when running 4 threads. On that platform, the compiler shouldn't be worrying about the alignments, yet it still reserves vectorization for the single dot product.
I wrote an SSE2 intrinsics version, but it's not worth the trouble.
tmp=dot_product(tmpmat(:,i),cx(:,j))
on Core 2. Register re-use apparently out-weighs lack of vectorization, with multi-threading.
Even on the Core i7, the vector code doesn't come out ahead when running 4 threads. On that platform, the compiler shouldn't be worrying about the alignments, yet it still reserves vectorization for the single dot product.
I wrote an SSE2 intrinsics version, but it's not worth the trouble.
Tim,
When the first dimension is even (real*8) or multiple of 4 (real*4), and tmpmat aligned to cache line, is the SSE of the loop you posted faster/slower?
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - Tudor
This code works well:
Just place your matrices as parameters.
[cpp]void OpenMPMatrixMultiply() { int i, j, k; #pragma omp parallel for private(j, k) for (i = 0; i < size1; i++) { for (j = 0; j < size3; j++) { int partial = 0; for (k = 0; k < size2; k++) { partial += matrix1* matrix2 ; } result1 += partial; } } }[/cpp]
Just place your matrices as parameters.
Thank you ,works well...

Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page