matrix multiply parallel programming

zhangzhe65 · ‎04-16-2009

why matrix multiply parallel programming runtime is slower than it's serial runtime. please tell me. Thank you.

void matrixp(int a,int b,int c)
{
int i,j,k,sum=0;
#pragma omp parallel private(i,j,k) // reduction(+:sum)
for(i=0;i for(j=0;j {
sum=0;
//c=0;
#pragma omp for reduction(+:sum)
for(k=0;k {
sum+=a*b;
c=sum;
}
}
}

TimP · ‎04-16-2009

The usual slogans apply here (from 20 years ago, "concurrent outer vector inner"). You would organize the loops so as to use simd instructions in the inner loops, as an optimizing compiler would attempt to do in the absence of OpenMP specifications, while applying threading to the outermost loop, with the threads working on data which are several cache lines apart. omp reduction is definitely not an optimization for this purpose, even if you have chosen data types for which there aren't suitable simd instructions.
You would want at least to attempt improvement on the code which you would get from icc -O3 -parallel operating on source code such as you would find in libgfortran matmul, perhaps looking at both float and int data types, noting that this code expects you to substitute a netlib threaded BLAS equivalent when the problem size exceeds a threshold such that threading would become useful.

TimP · ‎11-16-2009

Quoting - tim18

The usual slogans apply here (from 20 years ago, "concurrent outer vector inner"). You would organize the loops so as to use simd instructions in the inner loops, as an optimizing compiler would attempt to do in the absence of OpenMP specifications, while applying threading to the outermost loop, with the threads working on data which are several cache lines apart. omp reduction is definitely not an optimization for this purpose, even if you have chosen data types for which there aren't suitable simd instructions.
You would want at least to attempt improvement on the code which you would get from icc -O3 -parallel operating on source code such as you would find in libgfortran matmul, perhaps looking at both float and int data types, noting that this code expects you to substitute a netlib threaded BLAS equivalent when the problem size exceeds a threshold such that threading would become useful.

Here is an example from Livermore Fortran Kernels using transposition of one of the operands, enabling unrolled and jammed vectorized dot products. It may exceed performance of MKL DGEMM, when the transposed matrix is small enough that f90 transpose is efficient:

tmpmat=transpose(vy(:25,:25))
c$omp parallel do private(j,i,tmp,tmp1)
do j= 1,n
i= 1
if(iand(25,1) == 1)then
px(1,j)= px(1,j)+dot_product(tmpmat(:,1),cx(:,j))
i= 2
endif
do i= i,24,2
tmp= 0
tmp1= 0
do k= 1,25
tmp= tmp+tmpmat(k,i)*cx(k,j)
tmp1= tmp1+tmpmat(k,i+1)*cx(k,j)
enddo
px(i,j)= px(i,j)+tmp
px(i+1,j)= px(i+1,j)+tmp1
enddo
enddo

jimdempseyatthecove · ‎11-16-2009

Tim,

I think you have a cut and paste problem

if(iand(25,1) == 1)then

is a constant expression that is always true

Jim Dempsey

jimdempseyatthecove · ‎11-16-2009

Tim,

I think transposition of larger matricies will work well too provided you do not transpose the entire matrix at once.
i.e. transposing the number of columns that fit within the SSE registers may have merrit(2 for REAL(8) and 4 for REAL(4), and double that when AVX comes out). This would be true if you can pipeline the transpostion with the now vectorized multiplication. The ranspositioning can be done using the integer instruction set while the multiplication can use the SSE FP path. On HT systems this might show additional improvements.

Jim Dempsey

TimP · ‎11-16-2009

Jim,
"if(iand(25,1) == 1)then

is a constant expression that is always true"
Yes, I put this in just so I don't forget that this line is conditional on the size of the matrix.
Thanks,
Tim

Tudor · ‎11-17-2009

This code works well:

[cpp]void OpenMPMatrixMultiply()
{
	int i, j, k;

#pragma omp parallel for private(j, k)
    for (i = 0; i < size1; i++)
    {
        for (j = 0; j < size3; j++)
        {
            int partial = 0;
            for (k = 0; k < size2; k++)
            {
                partial += matrix1 * matrix2;
            }
            result1 += partial;
        }
    }
}[/cpp]

Just place your matrices as parameters.

TimP · ‎11-22-2009

Quoting - tim18

do k= 1,25
tmp= tmp+tmpmat(k,i)*cx(k,j)
tmp1= tmp1+tmpmat(k,i+1)*cx(k,j)
enddo

Compilers don't like to vectorize this, at least not with the odd leading dimension, apparently due to the inconsistent alignments of the array sections. This produces the curious result that the code I show is faster than the vectorized code,
tmp=dot_product(tmpmat(:,i),cx(:,j))
on Core 2. Register re-use apparently out-weighs lack of vectorization, with multi-threading.
Even on the Core i7, the vector code doesn't come out ahead when running 4 threads. On that platform, the compiler shouldn't be worrying about the alignments, yet it still reserves vectorization for the single dot product.
I wrote an SSE2 intrinsics version, but it's not worth the trouble.

jimdempseyatthecove · ‎11-23-2009

Quoting - tim18

Compilers don't like to vectorize this, at least not with the odd leading dimension, apparently due to the inconsistent alignments of the array sections. This produces the curious result that the code I show is faster than the vectorized code,
tmp=dot_product(tmpmat(:,i),cx(:,j))
on Core 2. Register re-use apparently out-weighs lack of vectorization, with multi-threading.
Even on the Core i7, the vector code doesn't come out ahead when running 4 threads. On that platform, the compiler shouldn't be worrying about the alignments, yet it still reserves vectorization for the single dot product.
I wrote an SSE2 intrinsics version, but it's not worth the trouble.

Tim,

When the first dimension is even (real*8) or multiple of 4 (real*4), and tmpmat aligned to cache line, is the SSE of the loop you posted faster/slower?

Jim Dempsey

mahmoudgalal1985 · ‎11-26-2009

Quoting - Tudor

This code works well:

[cpp]void OpenMPMatrixMultiply()
{
	int i, j, k;

#pragma omp parallel for private(j, k)
    for (i = 0; i < size1; i++)
    {
        for (j = 0; j < size3; j++)
        {
            int partial = 0;
            for (k = 0; k < size2; k++)
            {
                partial += matrix1 * matrix2;
            }
            result1 += partial;
        }
    }
}[/cpp]

Just place your matrices as parameters.

Thank you ,works well...