3-D arrays and matrix multiplications

OP1 · ‎08-19-2008

This probably seems a bit simple...

I am wondering what is the most efficient way of storing a list of matrices (say,50 1000x1000 matrices for instance) and multiply them by anotherlist of matrices. The multiplication is supposed to be performed by a subroutine.

Here is a pseudo code to illustrate what I want to achieve:

PROGRAM TEST
IMPLICIT NONE
INTEGER(4) N,N_MATRICES
REAL(8),ALLOCATABLE :: LIST_A(:,:,:),LIST_B(:,:,:),LIST_C(:,:,:)
! Allocate and initialize LIST_A(N,N,N_MATRICES), LIST_B(N,N,N_MATRICES)
! AllocateLIST_C(N,N,N_MATRICES)
CALL MULTIPLY_MATRICES(N,N_MATRICES,LIST_A,LIST_B,LIST_C)
END PROGRAM

SUBROUTINE MULTIPLY_MATRICES(N,N_MATRICES,LIST_A,LIST_B,LIST_C)
IMPLICIT NONE
REAL(8) LIST_A(N,N,N_MATRICES),LIST_B(N,N,N_MATRICES),LIST_C(N,N,N_MATRICES)
INTEGER(4) I
DO I=1,N_MATRICES
LIST_C(:,:,I) = MATMUL(LIST_A(:,:,I),LIST_B(:,:,I))
ENDDO
END SUBROUTINE

Is this the best way to do this?
Should I declare the arrays in the subroutine as being ALLOCATABLE, and include an interface in the main program?
Would DGEMM perform better than MATMUL?
Should I use an array of pointers pointing towards each of the matrices instead of manipulating a big list of matrices?
Can this be easily parallelized (using OpenMP directives) ? If each thread needs to access simultaneously values in these lists, this should create a huge overhead as they *fight* against in other to get access to variables stored in different locations of the same array... If I make a private copy of LIST_A and LIST_B for each thread, isn't the risk to blow the stack?

I am concerned about poor memory management and overhead with this type of approach.

Thanks a lot for your help/comments/ideas...

Olivier

jimdempseyatthecove · ‎08-19-2008

Olivier,

Try

PROGRAM TEST
USE OMP_LIB
...

!$OMP PARALLEL DO
DO I=1,N_MATRICES
 LIST_C(:,:,I) = MATMUL(LIST_A(:,:,I),LIST_B(:,:,I))
ENDDO
!$OMP END PARALLEL DO

Make sure you compile with OMP enabled

The above will not require a private copy of LIST_A and LIST_B
Note, I is implicitly private

If you are allocating and deallocating your lists many times during the run of the applicaiton you might want to consider using and array of pointers to 2 dimensional arrays. This will reduce the allocation size (at the expense ofperforming N_MATRICIES allocations and deallocations).

Jim Dempsey

abhimodak · ‎08-19-2008

Hi Olivier and Jim

My 2 cents...I have done some performance tests and found that, for large matrices, kernel libraries such as Intel MKL performs better than Fortran intrinsic MatMul. (That made me sad...)

Abhi

TimP · ‎08-19-2008

DGEMM, from a parallelized library such as Intel MKL, is the only easy way to parallelize this efficiently, if your matrices are as large as you say, letting MKL work on them one at a time. If they weren't big enough to keep all your cores busy that way, you would parallelize the loop outside MATMUL and restrict each MATMUL to a single thread, possibly using MKL sequential for matrices between 20x20 and 50x50.

OP1 · ‎08-20-2008

Thank you for your answer tim18!

I will use DGEMM indeed, it seems more efficient - especially since it is parallelized.

One of my other questions was related to how efficient the storage of these matrices (in a 3-D array) is, for any type of calculation. I made sure that the array is declared so that the matrix index is the rightmost index. Should I be better off declaring 50 independent matrix pointers rather than one large 3-D array?

Similarly, your comment about letting DGEMM operate in parallel (as opposed to parallelizing at the matrix index level) brings another question: how can we know where and when to parallelize a loop? Obviously it depends on the size of the data. But size in relation to what? The size of the stack?

I am still not sure I understand thoroughly how the data flows on multiple core CPUs and multiple CPU systems. Can somebody point out to a good reference on the topic?

I am developing a code which does a series of iterations of totallyindependent loops. Each loop contains a rather complicated series of function calls, etc. Ideally the code and data should be entirely duplicated on each CPU/core and then the results gathered at the end of execution. Now I am wondering if, in light of your comment, this will really work, due to the size of the data each thread needs to access/manipulate.

Ultimately this code will run on a cluster of computers. Are there any fundamental differences I need to be aware of in terms of parallelizing philosophy?

I understand these are trivial questions for many here - I would appreciate if somebody could point out at me a good reference for this.

Thanks a lot!

Olivier

jimdempseyatthecove · ‎08-20-2008

Oliver,

Because your data are 50 (+/-)2D arrays and not one 3D array you might see some efficiency is using a pointer to a 2D array in lieu of passing a reference to a 3D array. Note, this may not make a difference in the MATMUL section of your code but it may make a significant difference elsewhere.

Also, by use of parallizing library the parallization in these librariesgenerally use there own threads instead of your application's OpenMP threads. The result of this is the introduction of more context switching as well as loss of control in your application as to data locality (your management techniques to influence the system to maintain data in the L2 cache of the core(s) working on particular sections of your data).

As to which is method is best... well you may have to try both and get the run times. Note, the times you get using the library replacement for MAMUL in a small test loop may be entirely different than the test times you experience when called within your application. I suggest you keep both code paths in your code and then run the timing tests on the complete application.

What might work best is a compromise where you use pointers to 2D arrays and call the MKL or other library in a single threaded mode (assuming that is possible).

Then your OpenMP parallel code can be enhanced in an attempt to keep the same 2D matrix's on the same cores. This will improve system cache usage. If you plan ahead and make your code changes correctly, the changes can go in quite easily. Think of something along the lines of

call PrepareList(YourList)
!$omp parallel private(I)
I = GetListIndex(YourList)
do while(I .gt. 0)
call DoWork(I)
I = GetListIndex(YourList)
end do
!$omp end parallel

The above is an outline and can be enhanced for performance or to fit your needs

PrepareList initializes a table which indicates if a list item has been chosen or not (sets to not chosen)

GetListIndex tends to get the same items from the list for each thread making the call. However,after processing all of that threads prefferend list items a check can be made for list items that are preferred to run on other threads. In this manner, if a thread stalls (e.g. context switched out to run something else on your system, or if list item work load is not the same for each item) then some other thread can process the item. Note, you would use this technique where DoWork has significant amount of work. When DoWork is relatively short then use a PARALLEL DO distribute the work. As to where the cutoff is to select the method, well you will have to run performance tests on this.

Jim Dempsey

TimP · ‎08-20-2008

There is noeasy answer to how large a loop should be for parallelization. With the parameters you mentioned, your matrices are large enough to minimize the overhead due to parallelization inside MKL DGEMM, and to require the cache blocking facility of MKL to spread sufficiently large pieces of the single matrix outover the available cache.

A more common problem is the question of minimum size of a loop required for effective threaded parallelism. It depends a great deal on platform and application issues. In most of my experience, threaded parallelism is effective only for outer loops, in accordance with the 20 year old slogan "concurrent outer, vector inner," as vectorization gains inner loop performance by grouping into chunks, and the threads need to operate on data as many cache lines apart as possible. Even with nested loops, if the loop is fairly simple, it usually requires a loop length of 150 to get an advantage from threading. Current multi-core CPUs are being designed withto improve parallel efficiency for shared data.

OP1 · ‎08-21-2008

Thanks Tim and Jim for your answers.

I think Jim's suggestion is well suited to what I plan on doing. Parallelizing for a very coarse granularity is the way to go in my case. Ideally, I would have 50CPUs executing the same complicated program (all starting with the same data but for one parameter), and then assemble the results obtained (a very minor part of the calculations). That's how parallel my case is, so to speak.

Now, I have another question:Assume I declare a PARALLEL regionwith the DEFAULT(PRIVATE) attribute, and assume that in this region there is one subroutine which refers to a module containing many variables. Will the variables in the module be considered privateor will they be shared?

In other words, dothe SHARED/PRIVATE statements only impact the variables declaredin the subroutine in which these statement occurs, or is there a cascading effect for any possible module variables declaredin the subroutines called within the parallel region?

Thanks for your insights on this,

Olivier

jimdempseyatthecove · ‎08-21-2008

Olivier,

The shared/private only relate to the stack-local variables: scalars and non-SAVED arrays.

subroutine foo
real :: A
real, save :: B
real :: C(3)
real, save :: D(3)
real, automatic :: E(3)

In the above A and E are known to be stack-local and are known to apply with shared/private
B and D are known to be static and are always shared
C may or may not be stack local depending on compiler implementation and options chosenand therefore it is uncertain as to if it applies to the default and behaviour. For array declarations in OpenMP applications be sure to include the SAVE or AUTOMATIC attribute as this will remove any ambeguity.

If you want module variables to be private, then I find it easiest to define a user defined type encompassing what you would normally have as module data and then create a thread private pointer to the data.

module ThreadContextModule
 use ObjectModule
 use OtherThingModule

 type TypeThreadContext
 SEQUENCE
 type(TypeObject), pointer :: pObject => NULL()
 type(TypeOtherThing), pointer :: pOtherThing => NULL()
 end type TypeThreadContext

 type(TypeThreadContext) :: ThreadContext
 COMMON /CONTEXT/ ThreadContext
 !$OMP THREADPRIVATE(/CONTEXT/)

end module ThreadContextModule

Then in your initialization code run a parrallel initialization piece of code that allocates the instance for each OpenMP thread.

Jim Dempsey

OP1 · ‎08-21-2008

Thanks Jim,

So, just to make sure that I get this right. Assume this is the parallelized part of my program:

!$OMP PARALLEL PRIVATE(ARG)
...
CALL MY_SUBROUTINE(ARG)
...
!$OMP END PARALLEL

and the following is the declaration of MY_SUBROUTINE:

SUBROUTINE MY_SUBROUTINE(A)
USE MY_MODULE
REAL(8) A
...
END SUBROUTINE

All the variables contained in MY_MODULE will be shared by the parallel threads.

Is this correct?

Olivier

jimdempseyatthecove · ‎08-22-2008

Olivier,

Consider:


SUBROUTINE MY_SUBROUTINE(A)
 USE MY_MODULE
 REAL(8) A
 REAL(8) ARG
 ...
 !$OMP PARALLEL PRIVATE(ARG)
 ...
 CALL MY_SUBROUTINE(ARG)
 ...
 !$OMP END PARALLEL
 ...
END SUBROUTINE

In the above ARG is local your MY_SUBROUTINE .and. each instance of parallel threads get there own private copy.

Now consider:


SUBROUTINE MY_SUBROUTINE(A)
 USE MY_MODULE   ! ARG declared inside MY_MODULE
 REAL(8) A
 ...
 !$OMP PARALLEL PRIVATE(ARG)
 ...
 CALL MY_SUBROUTINE(ARG)
 ...
 !$OMP END PARALLEL
 ...
END SUBROUTINE

Each instance of thread has MY_MODULE in scope. In my opinion the aboveshould error out. I haven't tried it to confirm the compiler (when OpenMP directives are processed) will catch this error. The above should be (when OpenMP directives are processed) the same error as:

SUBROUTINE MY_SUBROUTINE(A)
USE MY_MODULE ! ARG declared inside MY_MODULE
REAL(8) A
REAL(8) ARG
...
!$OMP PARALLEL PRIVATE(ARG)
...
CALL MY_SUBROUTINE(ARG)
...
!$OMP END PARALLEL
...
END SUBROUTINE

ARG is doubly declared.

When a USE module brings in data (static data initialized or not) it is the same as linking in MY_MODULE.OBJ (the data portion) the same as if it came from a library or an explicit dependency in your project file. MY_MODULE.MOD is the same as a C/C++ precompiled header file without static data.

Jim Dempsey

OP1 · ‎08-25-2008

Thanks a lot Jim! I have a much better understanding of this now.

Olivier