Question about "Working with the Intel Math Kernel Library Cluster Software"

Zhanghong_T_ · ‎03-21-2010

Dear administrator,

I am trying to using the Intel MKL cluster version. The work envrionment is Win 7 x64 + VS2008 + Intel Visual Fortran 11.1 (the MKL was also included in it) + Intel MPI 3.2.1.009. I linked the following static linked library to my VS2008 project:

mkl_intel_c.lib mkl_intel_thread.lib mkl_core.lib libiomp5mt.lib impi.lib impicxx.lib mkl_scalapack_core.lib mkl_blacs_intelmpi.lib

The program works well without any additional setting. However, after I set the environment variable OMP_NUM_THREADS=4 (my machine has 8 CPU cores), the calculation time increased from 28 seconds to 29 seconds.

Another result I noticed is that the residual error of the solution (I used the BLAS for the direct sparse solver) is increased from about 1.e-11 to 1.e-8 when the environment variable OMP_NUM_THREADS was set to 4.

Could anyone help me to find what lead to the problem?

Thanks,
Zhanghong Tang

Gennady_F_Intel · ‎03-23-2010

Hi Tang,

just for clarifying your Execution Environment:You are working on x64 OS and willing to use application linked wit32 bit libraries. Right?

- we strongly recommend to use dynamic (libiomp5md) instead of static (mt) you used

quote - the calculation time increased from 28 seconds to 29 seconds

- What did you expect?

- by the default, MKL will use all available cores on the system where the application is running (in your case - MKL used all 8 cores). When you set OMP_NUM_THREADS=4, then MKL will use only 4 threads.

Please read the UserGuide ( chapter 6 - Using MKL's Parallelism) for more details.

-but it should be noted, that behavior will depends on input size task.Sometimes, the decreasing the number of active threads will produce the better performance because threads overhead with small task and etc.

--Gennady

Zhanghong_T_ · ‎03-23-2010

Dear Dr. Gennady,

Thank you very much for your kindly reply. In fact, I will build both 32 and 64 bit application on x64 OS. The libs I used to build 64 bit application are:

mkl_intel_lp64.lib mkl_intel_thread.lib mkl_core.lib libiomp5mt.lib impi.lib impicxx.lib mkl_blacs_intelmpi_lp64.lib mkl_scalapack_lp64.lib

I should correct the words I said before: The result of 28 seconds was got by setting OMP_NUM_THREADS=1 not without any setting.

I think the input size task is not small. However, I will try to test with your suggestions:
1) use dynamic link;
2) increase the input size to the application and compare the time again.

Thanks,
Zhanghong Tang

Zhanghong_T_ · ‎03-23-2010

Dear Dr. Gennady,

Just now I tried to use MD instad of MT to build the application again and a new test was done. Unfortunately, the application doesn't improve very much--just from 29 seconds to 27 seconds. I use the BLAS of MKL to solve sparse matrix with the unknown number more than 160,000.

I expect that the CPU time should decrease more compared to OMP_NUM_THREADS=1 since the BLAS subroutines spend most of CPU time during solution.

BTW: I checked the usergide and didn't find the chaper "using MKL's parallelism":
Chapter 1 Overview
Chapter 2 Getting Started
Chapter 3 Intel? Math Kernel Library Structure
Chapter 4 Configuring Your Development Environment
Chapter 5 Linking Your Application with the Intel? Math Kernel Library
Chapter 6 Managing Performance and Memory
Chapter 7 Language-specific Usage Options
Chapter 8 Coding Tips
Chapter 9 Working with the Intel? Math Kernel Library Cluster Software
Chapter 10 Getting Assistance for Programming in the Microsoft Visual Studio* IDE
Chapter 11 LINPACK and MP LINPACK Benchmarks
Appendix A Intel? Math Kernel Library Language Interfaces Support
Appendix B Support for Third-Party Interfaces

I got the related information from Chapter 9.

Thanks,
Zhanghong Tang

Gennady_F_Intel · ‎03-24-2010

HelloZhanghong Tang,

one more question - if you are trying to solve sparse matrix why you are using BLAS functionality for that?

Did you try PARDISO? (Parallel Direct Sparse Solver ).

One moment - PARDISO works on SMP system only, but the size of your task ( 160K unknown ) is not very big and can be solved by Pardiso efficiently.

Regarding MKL User Guide: I mentioned the chapter 6 "Managing Performance and Memory".

--Gennady

Zhanghong_T_ · ‎03-25-2010

Dear Dr. Gennady,

Thank you very much for your kindly reply. Now we are trying to develop the sparse solver--AMG solver ourself, we used the BLAS inside the solver.

PARDISO is an excellent solver which is very very fast, we take it as the default solver. However, for some large problems to be solved on 32 bit machines, the use of memory is too large so it is necessary to develop the iterative solver.

On the other hand, I have also checked chapter 6 you suggested, there are many environment variables can be set, I tried to set these variables and didn't find significant improvement.

Thanks,
Zhanghong Tang

Gennady_F_Intel · ‎03-25-2010

Does that mean you are talking about Iterative Sparse Solvers but not direct. Am I right?

What types of ISS are you using? I guess the input task size is ~160K as you mentioned earlier?

--Gennady

Zhanghong_T_ · ‎03-26-2010

Dear Gennady,

Thank you very much for your kindly reply. I use the algebric multigrid (a kind of ISS) to replace the PARDISO when the problem size is large. The case of size 160K is only used to test for the algebric multigrid solver, the real problem size could be tens of millions.

With the test case I wish the solver whcih calls the BLAS functions of MKL have an improvement when the number of cores is larger than 1.

Thanks,
Zhanghong Tang

Gennady_F_Intel · ‎03-26-2010

Well, thanks for theexplanation. Nevertheless the original issue you mentioned:

The program works well without any additional setting. However, after I set the environment variable OMP_NUM_THREADS=4 (my machine has 8 CPU cores), the calculation time increased from 28 seconds to 29 seconds.

1) if in this case you are talking about Pardiso and the input matrix size is ~160K, then this is an unexpectedperformance behaiviour for us.

What is the type ofmatrices are you solving?

Is it in-core mode?

2)for the real big tasks, like you have (tens millions) may be make sense to check OOC (out-of-core) mode of Pardiso.At the current moment ( MKL 10.2 Update 4) Parido OOC threaded only for symmetrical positive types.The next update 5 ( which will be released soon) will contains all types of matrices threaded.

--Gennady

Zhanghong_T_ · ‎03-26-2010

Hi Gennady,

Thank you very much for your kindly reply. I think that I still didn't explain it clearly.

I am testing the AMG solver with MKL's BLAS used. The functions of the BLAS could be parallelled when the number of cores is larger then 1 and as a result, the result of BLAS functions could be faster than single core. As a result, the AMG solver could be faster when multiple cores used.

However, my results show that the solution time of AMG solver didn't decrease, but be equal or increase, I don't know whether I have anything missing to let the BLAS functions of MKL works on multiple cores parallel.

Thanks,
Zhanghong Tang

Gennady_F_Intel · ‎03-26-2010

HiZhanghong Tang,

What concrete BLAS functions from MKL are you using?

--Gennady

Zhanghong_T_ · ‎03-27-2010

Dear Gennady,

The following functions are used:
NRM2
SWAP
TRSM
GEMM
COPY
TRSV
GEMV
GERU
AXPY
SCAL

Thanks,
Zhanghong Tang

Gennady_F_Intel · ‎03-30-2010

HelloZhanghong Tang,

among these functions there two ( GERU and SCAL ) which are no threaded at this moment. All others are threaded and for 160K sizes we have see the results for the different threads.

Can you check how it works on SMP ( not a Clusterversion you are using) case. Just try to run your testprogram( 160K) on one node only. Rebuild your programwithoutCluster components and check how it works with MKL_NUM_THREADS == 1 and 4 threads.

It will help you to understand wich of MKL's routines will affect onscalability most.

--Gennady