Optimal Matrix Operations

mohanmuthu · ‎07-19-2012

I work with huge data, where matrix multiplication too involved. Input matrices are sized5 million x 10 & 10 x 100. I tried to do the MATMUL operation as a whole, but faced the stack size error.So, nowperforming insteps of 20,000. Can any one suggest the faster way of doing it?

TimP · ‎07-19-2012

If you tried the opt-matmul implicit invocation of MKL with threaded parallelism, you might need to adjust global stack (/link /stack) and thread stack (KMP_STACKSIZE). ifort MATMUL usually allocates another matrix for the result, even in cases where that might be avoided by calling ?gemm directly.
If you don't want to use 64-bit (Intel64) mode, splitting the problem may be required.

John_Campbell · ‎07-19-2012

I wrote a sample program to test a number of alternative approaches to MATMUL. They should produce different run time results if you have limited physical memory. On my PC they all worked and showed no problems, including the use of MATMUL.
If memory is limited, check the size of your paging file.

[bash]! program to test problem, in smaller memory ! real*8, allocatable, dimension(:,:) :: A, B, C integer*4 l,m,n, stat, i,j, to_mb real*8 e, dt external dt ! write (*,99) 'Start test', dt() l = 5000000 m = 10 n = 100 ! to_mb = 1024*1024/kind(C) allocate ( C(l,n), stat=stat ) ; write (*,98) 'C allocated',stat, dt(), ' size C', size(C) / to_mb,' mb' allocate ( A(l,m), stat=stat ) ; write (*,98) 'A allocated',stat, dt(), ' size A', size(A) / to_mb,' mb' allocate ( B(m,n), stat=stat ) ; write (*,98) 'B allocated',stat, dt(), ' size B', size(B) / to_mb,' mb' 98 format (a,i5,f10.3,a,i6,a) ! A = 1 B = 2 ! call matrix_multiply (A, B, C, l,m,n) ! C = MatMul (A, B) write (*,99) 'Use intrinsic MATMUL function ', dt() 99 format (a,f10.3) ! e = 0 do j = 1,n do i = 1,l e = e + abs(c(i,j)-20.0d0) end do end do write (*,97) 'error = ',e, dt() 97 format (13x,a,es10.3,f10.3) end subroutine matrix_multiply (A, B, C, l,m,n) ! ! subroutine to multiply = x ! ! uses different loop order to control memory access ! integer*4 l, m, n real*8 A(l,m), B(m,n), C(l,n) real*8, allocatable, dimension(:,:) :: at real*8, allocatable, dimension(:) :: row ! integer*4 i,j,k, stat, to_mb real*8 s, dt, e external dt ! ! initialise C to zero so order of do... ) e = 0 do j = 1,n do i = 1,l s = 0 do k = 1,m s = s + A(i,k) * B(k,j) end do C(i,j) = s e = e + abs(c(i,j)-20.0d0) end do end do write (*,99) 'modified order to sequentially cycle through ', dt() write (*,97) 'error = ',e ! ! Use transpose to sequentially cycle through ' and ! modified order to sequentially cycle through and ' ! allocate (AT(m,l), stat=stat) do k = 1,m do i = 1,l AT(k,i) = A(i,k) end do end do to_mb = 1024*1024/kind(C) write (*,98) 'AT allocated',stat, dt(), ' size AT', size(AT) / to_mb,' mb' 98 format (a,i5,f10.3,a,i6,a) ! e = 0 do j = 1,n do i = 1,l C(i,j) = dot_pr...'", dt() write (*,97) 'error = ',e deallocate (AT) ! ! Conventional order, but use only row of A for sequential access of ' ! has problem that it does not sequentially access allocate (row(m)) ! row(10) e = 0 do i = 1,l row = A(i,:) do j = 1,n C(i,j) = dot_product ( row, B(:,j)) e = e + abs(c(i,j)-20.0d0) end do end do write (*,99) 'Conventional order, but use only row of A '...