Parallel FGMRES

Amar_K_1 · ‎04-26-2013

Hello!

What paradigm of parallel processing is FGMRES designed for? SPMD or MPMD ?

Is it possible to run FGMRES in parallel in the Single Program Multiple Data paradigm?

I intend to partition my matrix into n parts and then invoke FGMRES sequentially in each of the n processors (as using the parallel flag took more time than the sequential run for the example routine). Does it make sense to do this? Are there other smarter ways of dealing with this?

Many Thanks,

SergeyKostrov · ‎04-26-2013

>>...as using the parallel flag took more time than the sequential run for the example routine... Could you provide more technical details? What does it mean parallel flag? Do you mean Intel C++ compiler command line option "-mkl:parallel"?

Amar_K_1 · ‎04-27-2013

Sergey Kostrov wrote:

>>...as using the parallel flag took more time than the sequential run for the example routine...

Could you provide more technical details? What does it mean parallel flag? Do you mean Intel C++ compiler command line option "-mkl:parallel"?

Thanks for getting back! Yes, I mean the option which can be set to sequential or parallel in the makefile:

ifndef threading

threading=parallel

endif

I'm writing my code in fortran90 (with some fortran77 sub routines).

SergeyKostrov · ‎04-29-2013

Thanks. >>...then invoke FGMRES sequentially in each of the n processors (as using the parallel flag took more time than the sequential >>run for the example routine). Does it make sense to do this?.. It is still Not clear what your application does, on what hardware it is executed, size of a data set to be processed, etc.

Amar_K_1 · ‎05-02-2013

Sergey Kostrov wrote:

It is still Not clear what your application does, on what hardware it is executed, size of a data set to be processed, etc.

Thanks for getting back and sorry for the delay!

I'm writing a finite element program and hence my problem reduces to solving a huge system of linear equations.Typical meshes (input data) involve millions of elements and close to a million nodes and hence the matrix I'm looking to invert will be about 1 million by 1 million in size (and unsymmetric). I will be running on clusters with the number of processors ranging from 32 - 512 at the maximum.

My current strategy is to divide the mesh into a certain number of partitions and give every processor a portion of the mesh. My currently serial source code will be parallelized using MPI in single program multiple data sense. As far as my understanding goes about the parallel flag in the makefile for FGMRES, there is no domain decompsition or matrix decomposition happening. So, I'm thinking that the parallelization that occurs with the parallel flag is MPMD? I need some clarification in this regard. Please advise!

Many Thanks

SergeyKostrov · ‎05-03-2013

>>...So, I'm thinking that the parallelization that occurs with the parallel flag is MPMD?.. I would rather call it as Single-Program-Multiple-Threads-Single-DataSet. I think you need to look at another command line option: ... /Qmkl[: arg ] link to the Intel(R) Math Kernel Library (Intel(R) MKL) and bring in the associated headers parallel - link using the threaded Intel(R) MKL libraries. This is the default when /Qmkl is specified sequential - link using the non-threaded Intel(R) MKL libraries cluster - link using the Intel(R) MKL Cluster libraries plus the sequential Intel(R) MKL libraries ...

Amar_K_1 · ‎05-04-2013

Sergey Kostrov wrote:

>> I think you need to look at another command line option: ...
/Qmkl[: arg ]

"Kindly bear with me for the long post, but to make things clear I need to present my problem elaborately "

Thanks for your advice! I'm sorry, I don't understand how to exactly use the piece of information you gave. I would like to mention that I'm using a linux cluster.

Currently, the way I generate my executable is by entering the following in my command line:

1. Sequential

ifort -xHost -check -g -traceback -I/System/CentOS5.4/INTEL/mkl/include -fpp source/light.f90 source/sourcecode1.f90 source/sourcecode2.f -L"/System/CentOS5.4/INTEL/mkl/lib/em64t" "/System/CentOS5.4/INTEL/mkl/lib/em64t"/libmkl_lapack95_lp64.a "/System/CentOS5.4/INTEL/mkl/lib/em64t"/libmkl_solver_lp64_sequential.a "/System/CentOS5.4/INTEL/mkl/lib/em64t"/libmkl_intel_lp64.a -Wl,--start-group "/System/CentOS5.4/INTEL/mkl/lib/em64t"/libmkl_sequential.a "/System/CentOS5.4/INTEL/mkl/lib/em64t"/libmkl_core.a -Wl,--end-group -lpthread -lm -o _results/intel_lp64_sequential_em64t_lib/executable.out

2. Parallel:

ifort -w -I/System/CentOS5.4/INTEL/mkl/include -fpp source/sourcecode1.f90 source/sourcecode2.f -L"/System/CentOS5.4/INTEL/mkl/lib/em64t" "/System/CentOS5.4/INTEL/mkl/lib/em64t"/libmkl_solver_lp64.a "/System/CentOS5.4/INTEL/mkl/lib/em64t"/libmkl_intel_lp64.a -Wl,--start-group "/System/CentOS5.4/INTEL/mkl/lib/em64t"/libmkl_intel_thread.a "/System/CentOS5.4/INTEL/mkl/lib/em64t"/libmkl_core.a -Wl,--end-group -L"/System/CentOS5.4/INTEL/mkl/lib/em64t" -liomp5 -lpthread -lm -o _results/intel_lp64_parallel_em64t_lib/executable_Parallel.out

Since programming is not my primary specialization, I just used the example makefile that was provided by Intel and observed what was printed on the command line on running make libem64t. I then copied that line and included the names of my source codes in the appropriate locations and generated my executable!

!1. Kindly advice if there is a risk involved with this approach.

!2. Now that you know how I generate the executable, could you give more information about how, where and why to use the /Qmkl[: arg ] option.

!3. Most importantly, I'm still not clear if I can invert my million by million matrix efficiently using FGMRES, in a program which is parallelized in Single Program Multiple Data (SPMD) sense?

Just to make things clear, this is how I would make use of intel mkl FGMRES in my code:

Lets consider an example with 32 processors:

a. My input, i.e. mesh will be partitioned into 32 parts, therefore every processor gets a part of the mesh.

b. A time loop then starts to run on every processor. Towards the end of every time loop, a matrix is generated in every processor. (This matrix is representative of the mesh partition that every processor holds.)

c. Once all the processors are ready with the matrix at the end of the first time loop................ For simplicity, at this stage, lets assume we were running on just 1 processor (sequentially - both my code and FGMRES from Intel mkl).......so we already have the matrix ready at the end of first time step.......This matrix would be inverted using FGMRES from Intel mkl. The results obtained will be fed as an input for the 2nd time step. This process repeats until the required amount of time steps are run.

To present my problem, lets go back to our parallel setup, at the point where we left, i.e. towards the end of first time step, when all the processors have their matrices ready to be inverted. Inverting the matrices separately on every processor will give a wrong result, because we didn't solve the complete mesh as a whole!

My question is:::::::::::: Is there a way in which Intel mkl FGMRES can run, so as to use matrices sitting in different processors, to calculate the matrix inverse? The matrix, whose inverse I'm after can be obtained by putting together the 32 matrices sitting on the 32 processors we considered. I hope I have presented my problem clearly.

Please advise!

Kind Regards

Amar_K_1 · ‎05-04-2013

NOTE: I should have used the phrase - "solving the linear system of equations" rather than "inverting the matrix", in my post above!