Issues with executing PARDISO, dcsrilu0_example0 MKL routines on Xeon Phi co-processor

manredd · ‎06-19-2014

Dear Mic people,

I am facing problems executing Intel MKL routines, PARDISO and dcsrilu0, on the Xeon Phi co-processor.
Please go through the report below. I would be very grateful if you could tell me any problems in my report.

The configuration of the Xeon Phi machine follows:

==============================================================
Compiler:
$ which icc
/opt/intel/composer_xe_2013_sp1.2.144/bin/intel64/icc
$ icc --version
icc (ICC) 14.0.2 20140120
Copyright (C) 1985-2014 Intel Corporation. All rights reserved.

MKL toolkit
/opt/intel/composer_xe_2013_sp1.2.144/mkl

Host CPU Info
...
processor       : 7
vendor_id       : GenuineIntel
cpu family      : 6
model           : 62
model name      : Intel(R) Xeon(R) CPU E5-2609 v2 @ 2.50GHz
stepping        : 4
cpu MHz         : 1200.000
...

Xeon Phi coprocessor
...
processor       : 239
vendor_id       : GenuineIntel
cpu family      : 11
model           : 1
model name      : 0b/01
stepping        : 3
cpu MHz         : 1052.630
cache size      : 512 KB
...

$ cat /proc/meminfo
MemTotal: 7882352 kB
MemFree: 6109436 kB
...

==============================================================

I am facing problems running the following examples supplied in the MKL toolkit.

examples_core/solverc/pardiso_unsym_c.c
examples_core/solverc/dcsrilu0_exampl1.c

I am using the following make options.
make sointel64 interface=ilp64
make sointel64 interface=lp64

I am creating large sparse matrices in CSR format in Python using scipy package.
Characteristics of some of the matrices are as follows:

CSR matrix is 10240x10240. Number of nonzeros 1058707.
CSR matrix is 15554x15554. Number of nonzeros 2434660
CSR matrix is 16384x16384. Number of nonzeros 2700567.
...

I am able to successfully solve using these matrices in Python using numpy.solve and scipy.

I am also able to successfully execute the examples on host processor (with varying number of MKL threads from 1 to 8) for
all these matrices. However, I am facing problems running these examples on Xeon Phi coprocessor (with varying number of MKL threads from 1 to 240). The segmentation faults happen consistently.

I tried automatic offload, compiler-assisted offload, and manually copying the executable and running on Xeon Phi. The failures
occur in all the scenarios.

Output from one of the executions is shown below:

===================================================================
MAX MKL threads 240.
CHANGING: Reading COO matrix from MM file CSRMM.mtx.
Time to read MM file 13.443625 seconds.
Converting COO matrix to CSR matrix.
Time to convert coo to csr format 14.641642 seconds.
CSR matrix is 16384x16384. Number of nonzeros 2700567.
PARDISO...

=== PARDISO: solving a real nonsymmetric system ===
The local (internal) PARDISO version is : 103911000
1-based array indexing is turned ON
PARDISO double precision computation is turned ON
METIS algorithm at reorder step is turned ON
Scaling is turned ON
Matching is turned ON

Summary: ( reordering phase )
================

Times:
======
Time spent in calculations of symmetric matrix portrait (fulladj): 1.078529 s
Time spent in reordering of the initial matrix (reorder)         : 11.579866 s
Time spent in symbolic factorization (symbfct)                   : 1.920166 s
Time spent in data preparations for factorization (parlist)      : 0.076818 s
Time spent in allocation of internal data structures (malloc)    : 20.075173 s
Time spent in additional calculations                            : 5.905185 s
Total time spent                                                 : 40.635737 s

Statistics:
===========
< Parallel Direct Factorization with number of processors: > 240
< Numerical Factorization with BLAS3 and O(n) synchronization >

< Linear system Ax = b >
             number of equations:           16384
             number of non-zeros in A:      2700567
             number of non-zeros in A (%): 1.006040

number of right-hand sides: 1

< Factors L and U >
             number of columns for each panel: 72
             number of independent subgraphs: 0
< Preprocessing with state of the art partitioning metis>
             number of supernodes:                    1120
             size of largest supernode:               15394
             number of non-zeros in L:                122143939
             number of non-zeros in U:                121034313
             number of non-zeros in L+U:              243178252
Time for PARDISO 40.637273 seconds.

Reordering completed ...
Number of nonzeros in factors = 243178252
Number of factorization MFLOPS = 2518180=== PARDISO is running in In-Core mode, because iparam(60)=0 ===
Percentage of computed non-zeros for LL^T factorization
Segmentation fault
===================================================================

Segmentation faults are happening always in numeric factorization.

I have attached the files (pardiso and dcsrilu0) for your perusal.

Could you please let me know how I can debug these failures in MKL routines?

Thanking you for all the help.

Best Regards
Manredd

Zhang_Z_Intel · ‎06-23-2014

The makefiles you used from the MKL examples are only good for building host-execution executables. To build the examples for MIC, you need to edit the makefiles by giving MIC specific compile and link options. See "MKL link line advisor" for proper options: https://software.intel.com/en-us/articles/intel-mkl-link-line-advisor

Notes:

These examples can only be built to run "natively" on MIC. The functions used in the samples do not support automatic offload.
It's possible to do "compiler-assisted offload" on these functions. But you need to change the code by adding offload pragmas.
The MKL functions in examples, namely PARDISO and the iterative solver, have not been optimized for MIC.

Giang_Bui · ‎08-19-2014

Hello Zhang,

I also faced the same problem. The instruction in "MKL link line advisor" does not work for me, I have to set compile option "-mmic -mkl=parallel". Note that PARDISO runs well on MIC with 1 thread, but fail in n>1 threads. Moreover, my matrix is small enough to not overload one thread. Based on that, I would assume there is a bug in numerical factorization of MKL PARDISO on MIC, any other ideas?

Can you also give a comment on when the optimized version of PARDISO on MIC release ?

Zhang_Z_Intel · ‎08-20-2014

Giang Bui wrote:

Hello Zhang,

I also faced the same problem. The instruction in "MKL link line advisor" does not work for me, I have to set compile option "-mmic -mkl=parallel". Note that PARDISO runs well on MIC with 1 thread, but fail in n>1 threads. Moreover, my matrix is small enough to not overload one thread. Based on that, I would assume there is a bug in numerical factorization of MKL PARDISO on MIC, any other ideas?

Can you also give a comment on when the optimized version of PARDISO on MIC release ?

When building your application for native execution on MIC, the most reliable compile/link option is:

-openmp -I$(MKLROOT)/include -mmic -L$(MKLROOT)/lib/mic -lmkl_intel_lp64 -lmkl_core -lmkl_intel_thread -lpthread -lm

You were able to run it with 1 thread on MIC, then you must have built it correctly. I'm not aware of a bug that leads to multi-threaded execution of PARDISO on MIC to fail. So I'm interested in seeing more details about your case. Can you please provide your test matrix, the type of the matrix, and your 'iparm' settings? Also, how did you set the number of threads?

We have ongoing effort of improve PARDISO on MIC. But I cannot say exactly when an optimized version of PARDISO for MIC will be available.