Checking number of threads

ljbetche · ‎07-01-2009

Hello,

I have recently converted a CFD code, which solves large linear systems, to use MKL FGMRES, and while it appears to work properly, the performance is much slower than I'd hoped. Is there a way a to check the number of threads/CPUs FGMRES is actually using? I'm running on an 8 core Xeon system, and am using OMP_SET_NUM_THREADS to set the number of threads to 8.

Lee

ljbetche · ‎07-01-2009

Actually, just checking the documentation, is the FGMRES solver threaded at all? Or do I need to parallelize it by breaking my matrix up into a number of sections?

If it si necessary to break up the problem, one further question: when calling DFGMRES_GET with IPAR(13) > 0, any vector, say V,can be used in place of the B vector to get the intermediate solution, correct? So there is no need to copy the right hand side into V before calling DFGMRES_GET for the first time? The example in the reference manual is somewhat confusing, as they use a vector RHS as the right hand side everywhere except when calling DFGMRES_GET, while using vector B for that routine, to avoid overwriting RHS. This make perfect sense, but the example also shows RHS being copied into B at the start of the routine, which seemspointless. Finally, will DFGMRES work properly if the right hand side vector changes throughout the solve (because of updating from one chunk of the original problem to another)?

TimP · ‎07-01-2009

Quoting - ljbetche

Actually, just checking the documentation, is the FGMRES solver threaded at all? Or do I need to parallelize it by breaking my matrix up into a number of sections?

As you imply, certain MKL functions aren't threaded because it is difficult to gain performance by starting new threads at that level. Or, it may be programmed to use fewer than available number of threads when the problem size so indicates. It's often beneficial to partition your problem and run multiple copies of MKL functions linked with mkl_sequential. If you call MKL inside your own OpenMP threaded region, it won't start new threads, even with mkl_thread, unless you set OMP_NESTED.
I would check the number of threads used by libiomp in each parallel region by turning on the openmp_profile (linking against libiompprof). In the case of a default linux dynamic link, the libiompprof may be activated by LD_PRELOAD, in case you don't wish to re-link. The libiompprof accumulates data on threaded regions and writes it to the file guide.gvs. This feature is only semi-supported; we have been warned it may go away next year.
Windows VTune can read guide.gvs and prepare plots, but there is more information in the text file than can be plotted. I have tried every which way to use libiompprof with Thread Profiler; that'a a frustrating task.

ljbetche · ‎07-01-2009

Quoting - tim18

As you imply, certain MKL functions aren't threaded because it is difficult to gain performance by starting new threads at that level. Or, it may be programmed to use fewer than available number of threads when the problem size so indicates. It's often beneficial to partition your problem and run multiple copies of MKL functions linked with mkl_sequential. If you call MKL inside your own OpenMP threaded region, it won't start new threads, even with mkl_thread, unless you set OMP_NESTED.
I would check the number of threads used by libiomp in each parallel region by turning on the openmp_profile (linking against libiompprof). In the case of a default linux dynamic link, the libiompprof may be activated by LD_PRELOAD, in case you don't wish to re-link. The libiompprof accumulates data on threaded regions and writes it to the file guide.gvs. This feature is only semi-supported; we have been warned it may go away next year.
Windows VTune can read guide.gvs and prepare plots, but there is more information in the text file than can be plotted. I have tried every which way to use libiompprof with Thread Profiler; that'a a frustrating task.

Tim,

My matrices contain more than 10^6 equations, so I can't imagine that MKL would choose to use fewer processors than are available, but my code runs marginally slower with MKL than with my old serial solver (hence the original question). If I were to partition my problem and explicitly parallelize using OpenMP (which I do not do now), would I need to switch to the sequential libraries, or could I just link as I do currently and use the regular MKL functions within my parallel region (I am using ifort to compile)? They should then automatically run unthreaded, correct? I ask only because I am running my code on one node of a large cluster which I obviously do not directly manage, and had to fight like heck to get it to compile properly in the first place, so I don't want to change my makefile unless I have to.

Also, I use ifort's OpenMP library (-openmp -openmb-lib=compat) insteadof MKL's libiomp; this should beOK, correct?

Lee

TimP · ‎07-01-2009

Yes, if you call the threaded MKL from a parallel region, and don't set OMP_NESTED, the MKL threading should not be activated. Linking with mkl_sequential would save a little overhead, but that should be negligible for a large problem.
If you are using the MKL which is integrated into an installation of ifort "Professional," the libiomp you get should be identical. I agree that it's preferable to use the ifort -openmp options to select the libiomp at link time.
I just checked out the libiompprof; in my case, it did identify the parallel regions which are set up in MKL. Thus my suggestion of using that as an easy way to find out how many threads are running, are they balanced, and how much time is spent in serial and parallel regions.

TimP · ‎07-01-2009

Quoting - tim18

Yes, if you call the threaded MKL from a parallel region, and don't set OMP_NESTED, the MKL threading should not be activated. Linking with mkl_sequential would save a little overhead, but that should be negligible for a large problem.
If you are using the MKL which is integrated into an installation of ifort "Professional," the libiomp you get should be identical. I agree that it's preferable to use the ifort -openmp options to select the libiomp at link time.
I just checked out the libiompprof; in my case, it did identify the parallel regions which are set up in MKL. Thus my suggestion of using that as an easy way to find out how many threads are running, are they balanced, and how much time is spent in serial and parallel regions.

It should go without saying that you set an appropriate KMP_AFFINITY for your system, but it occured to me that you didn't mention it, so now I've mentioned it.

ljbetche · ‎07-02-2009

Quoting - tim18

It should go without saying that you set an appropriate KMP_AFFINITY for your system, but it occured to me that you didn't mention it, so now I've mentioned it.

Tim,

Actually, I'm new to parallel programming, so I'd not known about that setting. After reading the ifort documentation, I think that KMP_AFFINITY = granularity=fine,compact would make the most sense for my application. I'll try that and see how the performance is affected. Thanks for the idea; it seems like even if I have to manually partition the problem, this setting is quite important.

Lee