subutilization of processor resources by fgmres

Feijoo__Gonzalo · ‎07-05-2018

Hi Everyone,

We are developing an application that uses the FGMRES function on the MKL library to solve systems of linear equations as part of Newton iterations. Recently we did a bit of benchmarking and found that, as the number of equations increases, the processor utilization goes down.

We instrumented the code and realized that calls to dfgmres take a progressively larger amount of the total time in the solution operation as the number of equations increases. Basically, we modified the "fgmres_full_fnct_c.c" file provided in the mkl examples directory and computed elapsed timed for different operations such as the calls to fgmres and the time to solve reverse communication callbacks such as RCI_request=1 (matrix-vector product), RCI_request=3 (application of preconditioner), etc. Here are a few numbers:

number of equations = 480k

total solution time = 8.6 s

(rci_request = 1) = 0.7 s

(rci_request = 3) = 2.2 s

calls to dfgmres = 4.9 s

number of equations = 950k

total solution time = 27 s

(rci_request = 1) = 1.8 s

(rci_request = 3) = 5.7 s

calls to dfgmres = 18 s

number of equations = 7,150k

total solution time = 820 s

(rci_request = 1) = 15 s

(rci_request = 3) = 83 s

calls to dfgmres = 700 s

We also took pictures of the resource manager and noted that processor utilization is very low for large periods of time, as low as 4%, despite the fact that mkl correctly sets the maximum number of threads to the number of cores (16) in the system.

Does anybody have an idea of what is happening?

Sincerely,

Gonzalo

PS: We have several, current licenses of Intel Parallel Studio but Intel's support site is not letting me submit this question to priority support because I am not associated with the account that was used to register the product in our office.

Feijoo__Gonzalo · ‎07-05-2018

By the way, when we solve the same systems of equations with the direct solver Pardiso processor utilization is a constant 50% (the system has 32 virtual cores, 16 physical cores). Gonzalo

Gennady_F_Intel · ‎07-05-2018

The cause of the case may be that dfgmres is not threaded or may be not efficiency implemented. What version of mkl do you use? Could you please export env varaible MKL_VERBOSE=1 and check the version number.

Feijoo__Gonzalo · ‎07-10-2018

Hi Gennady,

Thank you for reply! We are using version 2017.1.143. I was under the impression that dfgmres is parallelized, and would be surprised if it is not. PARDISO, sparse matrix-vector products are parallelized so I thought this would extent to the functions implementing iterative solvers.

Please, let me know!

Best, Gonzalo

Gennady_F_Intel · ‎07-12-2018

Hello Gonzalo, actually fgmres is not threaded, but we don't expect to that will be a problem because of the perf bottleneck of such sort of computations - matrix-vector multiplication and precondition handle. But based on your results, you see the bottleneck is fgmres itself. How could we check the problem on our side? thanks