Solved: Pardiso does not want to utilize dual core

slavaua · ‎04-18-2010

Hello dear community members,

I am trying to getpardiso_unsym_c.c running on my dual core machine under Ubuntu 9.10 and use both cores.

I have tried export MKL_NUM_THREADS=2

Changediparm[2] = 2;

And then running make which is doing:

icc -xK -w -I/opt/intel/Compiler/11.1/069/mkl/include source/pardiso_unsym_c.c -L"/opt/intel/Compiler/11.1/069/mkl/lib/32" "/opt/intel/Compiler/11.1/069/mkl/lib/32"/libmkl_solver.a "/opt/intel/Compiler/11.1/069/mkl/lib/32"/libmkl_intel.a -Wl,--start-group "/opt/intel/Compiler/11.1/069/mkl/lib/32"/libmkl_intel_thread.a "/opt/intel/Compiler/11.1/069/mkl/lib/32"/libmkl_core.a -Wl,--end-group -L"/opt/intel/Compiler/11.1/069/mkl/lib/32" -liomp5 -lpthread -lm -o _results/intel_parallel_32_lib/pardiso_unsym_c.out

export LD_LIBRARY_PATH="/opt/intel/Compiler/11.1/069/mkl/lib/32":/opt/intel/Compiler/11.1/069/lib/ia32:/opt/intel/Compiler/11.1/069/ipp/ia32/sharedlib:/opt/intel/Compiler/11.1/069/mkl/lib/32:/opt/intel/Compiler/11.1/069/tbb/ia32/cc4.1.0_libc2.4_kernel2.6.16.21/lib:/opt/intel/Compiler/11.1/069/mkl/lib/32; _results/intel_parallel_32_lib/pardiso_unsym_c.out >_results/intel_parallel_32_lib/pardiso_unsym_c.res

icc -xK -w -I/opt/intel/Compiler/11.1/069/mkl/include source/pardiso_unsym_c.c -L"/opt/intel/Compiler/11.1/069/mkl/lib/32" "/opt/intel/Compiler/11.1/069/mkl/lib/32"/libmkl_solver.a "/opt/intel/Compiler/11.1/069/mkl/lib/32"/libmkl_intel.a -Wl,--start-group "/opt/intel/Compiler/11.1/069/mkl/lib/32"/libmkl_intel_thread.a "/opt/intel/Compiler/11.1/069/mkl/lib/32"/libmkl_core.a -Wl,--end-group -L"/opt/intel/Compiler/11.1/069/mkl/lib/32" -liomp5 -lpthread -lm -o _results/intel_parallel_32_lib/pardiso_unsym_c.outexport LD_LIBRARY_PATH="/opt/intel/Compiler/11.1/069/mkl/lib/32":/opt/intel/Compiler/11.1/069/lib/ia32:/opt/intel/Compiler/11.1/069/ipp/ia32/sharedlib:/opt/intel/Compiler/11.1/069/mkl/lib/32:/opt/intel/Compiler/11.1/069/tbb/ia32/cc4.1.0_libc2.4_kernel2.6.16.21/lib:/opt/intel/Compiler/11.1/069/mkl/lib/32; _results/intel_parallel_32_lib/pardiso_unsym_c.out >_results/intel_parallel_32_lib/pardiso_unsym_c.res

But getting:

Statistics:

===========

< Parallel Direct Factorization with #processors: > 1

Anybody know why that may be happening and what can I do to fix it?

Here iscat /proc/cpuinfo

processor : 0

vendor_id : GenuineIntel

cpu family : 6

model : 15

model name : Intel Core2 Duo CPU T5550 @ 1.83GHz

stepping : 13

cpu MHz : 1000.000

cache size : 2048 KB

physical id : 0

siblings : 2

core id : 0

cpu cores : 2

apicid : 0

initial apicid : 0

fdiv_bug : no

hlt_bug : no

f00f_bug : no

coma_bug : no

fpu : yes

fpu_exception : yes

cpuid level : 10

wp : yes

flags : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm constant_tsc arch_perfmon pebs bts pni dtes64 monitor ds_cpl est tm2 ssse3 cx16 xtpr pdcm lahf_lm

bogomips : 3657.02

clflush size : 64

power management:

processor : 1

vendor_id : GenuineIntel

cpu family : 6

model : 15

model name : Intel Core2 Duo CPU T5550 @ 1.83GHz

stepping : 13

cpu MHz : 1000.000

cache size : 2048 KB

physical id : 0

siblings : 2

core id : 1

cpu cores : 2

apicid : 1

initial apicid : 1

fdiv_bug : no

hlt_bug : no

f00f_bug : no

coma_bug : no

fpu : yes

fpu_exception : yes

cpuid level : 10

wp : yes

flags : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm constant_tsc arch_perfmon pebs bts pni dtes64 monitor ds_cpl est tm2 ssse3 cx16 xtpr pdcm lahf_lm

bogomips : 3657.51

clflush size : 64

power management:

Thanks in advance for any help.

Gennady_F_Intel · ‎04-22-2010

Slava,

you wrote:

Time malloc : 202.740895 s

Time total : 206.280643 s total - sum: 1.073926 s

Time malloc : 202.740895 s

Time total : 206.280643 s total - sum: 1.073926 s

Could you please switch off the matching mechanism ( set iparm[12] == 0) and see the results.

--Gennady

View solution in original post

Gennady_F_Intel · ‎04-18-2010

Hello Slava,

Do you mean pardiso_unsym_c.c from solver's examples?

\examples\solver\source\

--Gennady

slavaua · ‎04-19-2010

Quoting Gennady Fedorov (Intel)

Hello Slava,
Do you mean pardiso_unsym_c.c from solver's examples?
\examples\solver\source\
--Gennady

Hi Gennady,

yes, I run it from there.

Gennady_F_Intel · ‎04-19-2010

Slava,

the input data of this example ( 5x5, nnz==13) is extremely small to see the multithreaded advantages of PARDSIO.For such small inputs, PARDISO will always run in serial mode.

--Gennady

slavaua · ‎04-19-2010

Thanks for your reply.

I have modified code of example to try at least 50x50 or 400x400 but now I get segmentation fault.

This is how I fill in input matrix:

MKL_INT n = 50; /* 5 */

MKL_INT i,j,z;

double b, x, a[n*n], o;

z=0;

for(i=0; i

for(j=0; j

o = (i+j)/0.89;

if(i==j-1) o = 10;

if(i==j+1) o = 2;

a = o;

z++;

}

for(i=0; i = (i*21 + 101) / 1.3;

MKL_INT ia[n+1];

for(i=0;i=1+(n*i);

ia=n*n+1;

MKL_INT ja[n*n];

for(z=0;z<(n*n);z++) for(i=0;i=j+1;

Maybe because it does not have zeroes?

Gennady_F_Intel · ‎04-19-2010

Please check the CSR format first of all - to check sparse matrix representation.

iparm(27) - matrix checker. Please refer to the MKL manual for details.

--Gennady

slavaua · ‎04-19-2010

Thanks Gennady, it was an issue with input data.

Now I am running performance tests and trying to compare difference with one core and two cores involved.

I must admit I am a little puzzled. I am generating a 4000 x 4000 system and this is how long it takes to solve it:

Times:

======

Time fulladj: 0.650926 s

Time reorder: 1.434606 s

Time symbfct: 0.380290 s

Time malloc : 202.740895 s

Time total : 206.280643 s total - sum: 1.073926 s

As you can see,Time malloc takes almost all of computing time. I wonder why?

And the only difference between 1 and 2 cores is in this bit:

Summary PARDISO: ( factorize to factorize )

================

Times:

======

Time A to LU: 0.000000 s

Time numfct : 9.179849 s

Time malloc : 0.000039 s

Time total : 9.179889 s total - sum: 0.000001 s

gflop/s for the numerical factorization: 4.647834

And with 2 cores involved it is

Summary PARDISO: ( factorize to factorize )

================

Times:

======

Time A to LU: 0.000000 s

Time numfct : 5.419863 s

Time malloc : 0.000037 s

Time total : 5.419902 s total - sum: 0.000001 s

gflop/s for the numerical factorization: 7.872242

But total times is something irrelevant:

Times:

======

Time fulladj: 0.649804 s

Time reorder: 1.430529 s

Time symbfct: 0.333696 s

Time parlist: 0.000537 s

Time malloc : 202.318808 s

Time total : 205.811701 s total - sum: 1.078326 s

Anybody knows why?

Gennady_F_Intel · ‎04-19-2010

Hi Slava,

I am a little puzzled as well with such results (ime malloc : 202.740895 s ) -:).

Your input is pretty small for sparse solvers. As an example in this threadTime malloc : 0.825073 s for allocation ~1.5*10^9 nnz.

I have noguesses yetwhy it happens. need to do some experiments. Can you give moredetailsabout your system's?

CPU type, RAM, 32 or 64 bit...

--Gennady

slavaua · ‎04-20-2010

Hi Gennady,

thanks for trying to help :)

CPU is Intel Core2 Duo CPU T5550 @ 1.83GHz - you can see this info earlier in the thread.

System installed is ubuntu 9.10, 32 bit version, compiling with lib32.

This machine has 3 GBs of RAM.

slavaua · ‎04-20-2010

I think it will help if I post source code here, this is just changed file from examples folder.

Just change n variable to set matrix dimension.

http://software.intel.com/file/26550

Gennady_F_Intel · ‎04-20-2010

I don'tunderstandthis statement:

for(z=0;z<(n*n);z++) for(i=0;i=j+1;

slavaua · ‎04-21-2010

Sorry Gennady, I have probably attached wrong file, this string should look like:

for(z=0;z

I have re-uploaded the file.

Gennady_F_Intel · ‎04-22-2010

Slava,

you wrote:

Time malloc : 202.740895 s

Time total : 206.280643 s total - sum: 1.073926 s

Time malloc : 202.740895 s

Time total : 206.280643 s total - sum: 1.073926 s

Could you please switch off the matching mechanism ( set iparm[12] == 0) and see the results.

--Gennady

slavaua · ‎04-22-2010

Gennady, this has made a hugedifference. Thank you!

Now 5000 x 5000 runs a lot faster.

I have a question however, those are times:

Summary PARDISO: ( reorder to reorder )

================

Times:

======

Time fulladj: 0.999903 s

Time reorder: 2.324973 s

Time symbfct: 0.587419 s

Time parlist: 0.000738 s

Time malloc : 0.553460 s

Time total : 5.853783 s total - sum: 1.387291 s

Summary PARDISO: ( factorize to factorize )

================

Times:

======

Time A to LU: 0.000000 s

Time numfct : 9.328830 s

Time malloc : 0.000039 s

Time total : 9.328870 s total - sum: 0.000001 s

Summary PARDISO: ( solve to solve )

================

Times:

======

Time solve : 0.076216 s

Time total : 0.367927 s total - sum: 0.291711 s

So total execution time is sum of total times above?

Gennady_F_Intel · ‎04-22-2010

yes, it should be the total execution time.

slavaua · ‎04-27-2010

Thank you Gennady, once again, your support is priceless.

I am back with another problem however :)http://software.intel.com/en-us/forums/showthread.php?t=73238&p=2#117309