Eugene,

GRAPE__ELIAS · ‎12-22-2013

Hi,

I have a control volume code that calls PARDISO several times to solve a large sparse matrix (different on each call). The reason it needs to be called several times is because the entries in the matrix change with time. The options I am currently using to compile my code are

<code> -openmp -traceback -g -check all -fpe0 -fp-stack-check -warn all -gen-interfaces -r8 -debug -fp-model strict -auto -module $(OBJDIR) -I$(MKL)/include </code>

and for the link line I use
<code> -L$(MKL)/lib/intel64 -lmkl_intel_lp64 -lmkl_core -lmkl_intel_thread -lpthread -lm </code>

In the past I have also tried

<code> -openmp -traceback -g -check all -fpe0 -fp-stack-check -warn all -gen-interfaces -r8 -debug -fp-model strict -auto -module $(OBJDIR) -mkl=parallel </code>

<code> -lpthread -lm </code>

which also successfully compiles. The fortran compiler I am using is intel-fc/14.1.106. The version of MKL I am using is whatever that version of ifort is bundled with (I THINK it's Composer XE 2013 sp1). The specs of the machine I am running my code on can be found in the following link (http://nci.org.au/nci-systems/national-facility/peak-system/raijin/detailed-system-configuration/).

I have followed the instructions found on http://software.intel.com/en-us/articles/determining-root-cause-of-sigsegv-or-sigbus-errors. That is I have tried both compiling with and without -heap-arrays, I have tried "ulimit -s unlimited" and "ulimit -s 999999999999". I have set KMP_STACKSIZE to values ranging from 16M to 2GB. I have also set MKL_PARDISO_OOC_MAX_CORE_SIZE=16384 and MKL_PARDISO_OOC_MAX_SWAP_SIZE=16384. Within my code, pardisoinit is called at the beginning of my solve subroutine using

<code> Call pardisoinit(pt,mtype,iparm) </code>

where mtype = 11. I then set the values

iparm(1)=1

iparm(2)=3

iparm(24)=1

iparm(25)=0

iparm(27)=1

iparm(60)=0

and then call pardiso four more times using phase = 11, 22, 33 and then -1. This will work for several time steps, but at some point I will get an error message, usually in the form of a segfault

<code>

forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
main 0000000000A807F9 Unknown Unknown Unknown
main 0000000000A7F170 Unknown Unknown Unknown
main 0000000000A357E2 Unknown Unknown Unknown
main 00000000009E6F68 Unknown Unknown Unknown
main 00000000009EB0AB Unknown Unknown Unknown
libpthread.so.0 0000146C1FBFA500 Unknown Unknown Unknown
libmkl_core.so 0000146C21ACF025 Unknown Unknown Unknown
libmkl_core.so 0000146C21AC80AD Unknown Unknown Unknown
libmkl_core.so 0000146C21AC7D7A Unknown Unknown Unknown
libmkl_intel_thre 0000146C20775F0F Unknown Unknown Unknown
libmkl_intel_thre 0000146C20776107 Unknown Unknown Unknown
libmkl_intel_thre 0000146C207761C3 Unknown Unknown Unknown
libmkl_intel_thre 0000146C20776594 Unknown Unknown Unknown
libiomp5.so 0000146C1F6D4623 Unknown Unknown Unknown

</code>

This error NEVER appears at the same iteration, i.e. the program can run through 10 time steps before hitting producing this error and then I can run the program again and it'll run for 1000 time steps. In addition, I have also sometimes got two different errors. One states something along the lines of glibc error which then produces a really long output (unfortunately, I haven't kept an output file for this example... if I get this one again I'll post it if someone thinks this maybe useful) and another error I have gotten from my code states "*** glibc detected *** ./main: free(): corrupted unsorted chunks: 0x00002b91bb025040 ***".

I know the program crashes during the phase=11 stage by using print statements before and after each call of PARDISO. I have tried to find out more information by setting msglvl=1, however this problem does not occur when msglvl=1. I will post a stripped down version of the code once I get the OK from my manager. Lastly, I would also like to add if the program doesn't crash, it produces results that seem reasonable (or even correct against known test cases).

Also I apologies in advance, I'm new to this forum and I'm not sure how to what the tags I'm meant to use for codes... My background is not in computer sciences/engineering or programming in general. I have been using Fortran 90 for approximately 5 years now, but that has been all self taught reading bits and pieces from the internet and I'm fairly sure my knowledge is very spotty. Also, I have never used Valgrind (I noticed that seems to be the first response to segfaults around here), but I have used a tiny bit of TotalView (they're basically the same right?), but never have been successful finding the source of any of my problems using it... Printing random strings of text before and after where I suspect problems are occurring is how I usually fix my problems...

Thanks in advance to anybody who spends sometime trying to help me here

Eugene

GRAPE__ELIAS · ‎12-22-2013

I forgot to add that on the cluster I request 1 node with 8 cpus, I have set OMP_NUM_THREADS=8 and KMP_NUM_THREADS=8, I have 32GB of ram and 160GB of HD space. After the program crashes the cluster reports

Memory Used: 5gb

Vmem Used: 6gb

Walltime Used: 02:40:59

CPU Time Used: 04:19:10

NCPUs Used: 8

The low speed up is partly due to the fact that I've removed all of the OpenMP directives, it's running in serial except for when PARDISO is called. Also, if I compile with -O3 -xHost -ip -ipo -axCORE-AVX2, usually the speedup (CPU time/Wall time) is of the order of 5.5 using 8 processors.

GRAPE__ELIAS · ‎12-22-2013

Please find attached a 7z of a stripped down version of my code with the Makefile and input file.

To run the program after making it type

/main test.in

I've also attached the PBS queue script so you know what environment I'm running the code in.

Thanks in advance

Eugene

Gennady_F_Intel · ‎12-22-2013

Eugene,

Can you give the comprehensive test (+iput data) for reproducing the problem on our side? this is the fastest way to identify the reason of the failure.

--Gennady

GRAPE__ELIAS · ‎12-22-2013

Hi Gennady

I assume you mean the matrices that I want to solve? The files readin.f90, amf.f90 and solve.f90 generate the matrices if that are solved by PARDISO so you should have everything you need (other than comments in the code... sorry about that, I was told to get rid of them before posting it here). I also just realised I attached the wrong input file... see new attached file.

readin.f90 defines the geometry at some point, amf.f90 calculates some coefficients and fill.f90 fills in the non-zero components of each row of the matrix. The matrix is stored in array 'ap', and the RHS of the equation is stored in the array 'af'. These are then converted to the compressed row format in the subroutine 'coup_dsolve' in solve.f90 which also calls PARDISO.

Changing lines 2-19 of test.in will directly influence the coefficient of the matrices.

Cheers

Eugene

GRAPE__ELIAS · ‎12-22-2013

In case you haven't already done so, if you change line 121 in solve.f90 to msglvl=1 and then compile and run you'll see the following with the input file I've attached to this forum.

=== PARDISO: solving a real nonsymmetric system ===
The local (internal) PARDISO version is : 103911000
1-based array indexing is turned ON
PARDISO double precision computation is turned ON
Parallel METIS algorithm at reorder step is turned ON
Scaling is turned ON
Matching is turned ON

Summary: ( reordering phase )
================

Times:
======
Time spent in calculations of symmetric matrix portrait (fulladj): 0.078204 s
Time spent in reordering of the initial matrix (reorder) : 1.156664 s
Time spent in symbolic factorization (symbfct) : 0.295954 s
Time spent in data preparations for factorization (parlist) : 0.018211 s
Time spent in allocation of internal data structures (malloc) : 0.193398 s
Time spent in additional calculations : 0.327489 s
Total time spent : 2.069920 s

Statistics:
===========
< Parallel Direct Factorization with number of processors: > 8
< Numerical Factorization with BLAS3 and O(n) synchronization >

< Linear system Ax = b >
number of equations: 543120
number of non-zeros in A: 4810712
number of non-zeros in A (%): 0.001631

number of right-hand sides: 1

< Factors L and U >
number of columns for each panel: 72
number of independent subgraphs: 0
number of supernodes: 308543
size of largest supernode: 748
number of non-zeros in L: 46473251
number of non-zeros in U: 41022475
number of non-zeros in L+U: 87495726

I cannot reproduce any of the errors when msglvl=1, but some sort of error will always happen within 24hrs of wall time when msglvl=0. This is true even for the test case which I've attached before. Ideally, I'd like to msglvl=0, since this information is not particularly useful for me and makes my output file that much larger...

Gennady_F_Intel · ‎12-23-2013

I used your example as is and here I see on my side: ./main test.in

main: Start
readin: Opening input file: test.in
readin: Debug!
ny,ny2,nyedge 186 163 25
readin: Debug!
nz,nzbot,nzl,nztop 730 186 360 186
zero: nn, nnn = 135780 543120
main:j, dy,y
0 4.457833222196250E-003 0.000000000000000E+000

...................................................

readin: sum ian 41426
readin: initial Max rne, rnp 0.312075467163009 312075467.163009
readin: initial Min rne, rnp 0.000000000000000E+000 0.000000000000000E+000
connectivity_table: nxy, nxyz = 186 135780
main: Starting loop 1
amft: intializing
amft: begin fill
amft: tstep 1.000000000000000E-010
main: Starting inner 1

the programm crushed before first mkl's ( Pardiso) routines somewhere inside poiss subrotine...... because of I don't see this Print *, "poiss: res and rel: ", res, rel

GRAPE__ELIAS · ‎12-23-2013

Hi Gennady,

Thanks for you prompt responses. The subroutine poiss calls coup_dsolve (line 27), which in turn calls PARDISO. The call to coup_dsolve is well before you get to that print statement. That said, I don't think I've ever had this program crash on the first iteration... Does the program exit with a stack trace telling where it failed, or is it similar to mine stating it's somewhere in libpthread.so.0?

Eugene

GRAPE__ELIAS · ‎12-23-2013

Also, feel free to edit the Makefile as you see fit (i.e. some of the optimizations, such as -axCORE-AVX2, may not be suitable for the computer you are running it on). At the moment, there are two lines that can define FFLAGS. The one that is uncommented at the moment is what I would normally use to run the code, whilst the commented line the compiler options which I use to debug my code.

Gennady_F_Intel · ‎12-23-2013

Ok, I will check again - how it will work on my side. regard to avx2: yes, sure -- the first things I did - removing avx2 options at all bacause of my systmes supports avx instructions only.

GRAPE__ELIAS · ‎01-05-2014

Happy New Year everyone!

I realize many will still be on holidays, but if anyone looking into my problem has come back to work, it would be nice to know if any progress has been made about my inquiry (e.g. has my error been reproduced on your end?).

Cheers

Gennady_F_Intel · ‎01-05-2014

I don't see problem on my side. the example works well. Here is the log of the latest iteration:

readin: sum ian 41426
readin: initial Max rne, rnp 0.312075467163009 312075467.163009
readin: initial Min rne, rnp 0.000000000000000E+000 0.000000000000000E+000
connectivity_table: nxy, nxyz = 186 135780
main: Starting loop 1
amft: intializing
amft: begin fill
amft: tstep 1.000000000000000E-010
main: Starting inner 1
==================== Pardiso, Phase == 11 ==================
poiss: res and rel: 17053661.0891833 1.601550078587388E-014
main: Starting inner 2
==================== Pardiso, Phase == 11 ==================
poiss: res and rel: 17053661.3839509 1.601550106269188E-014
main: Starting inner 3
==================== Pardiso, Phase == 11 ==================
poiss: res and rel: 17053661.3536814 1.601550103426507E-014
main: Starting inner 4
==================== Pardiso, Phase == 11 ==================
poiss: res and rel: 17053661.2850739 1.601550096983415E-014
Main: time is now 2.000000000000000E-010

.............................

............................

main: Starting loop 2000

amft: intializing
amft: begin fill
amft: tstep 1.000000000000000E-010
main: Starting inner 1
==================== Pardiso, Phase == 11 ==================
poiss: res and rel: 393711423.555126 3.720627873926340E-013
main: Starting inner 2
==================== Pardiso, Phase == 11 ==================
poiss: res and rel: 392391086.247422 3.708150525443943E-013
main: Starting inner 3
==================== Pardiso, Phase == 11 ==================
poiss: res and rel: 392391063.392136 3.708150309404636E-013
main: Starting inner 4
==================== Pardiso, Phase == 11 ==================
poiss: res and rel: 392391063.070478 3.708150306343708E-013
Main: time is now 2.001000000000030E-007

I will try to check the problem once more, in the case if the problem happens sporadically and let you know the results...

GRAPE__ELIAS · ‎01-05-2014

Thanks Gennardy!

Assuming you are unable to reproduce any of the messages I get with that code, and knowing I've followed all the steps provided on the page http://software.intel.com/en-us/articles/determining-root-cause-of-sigsegv-or-sigbus-errors, what would you suggest my next steps should be to find out what the problem is?

Gennady_F_Intel · ‎01-10-2014

Eugene,

after several additional attempts, I reproduced the issue and twice... the problem happened after 185 and 627 iterations... The cause of the issue is not clear and required additional investigations.. The problem is escalated. we will keep you updated in the case if any news. Thanks for the issue.

/gf

GRAPE__ELIAS · ‎01-27-2014

Hi Gennardy,

I just found out that if I set iparm(2)/=3 (i.e. iparm(2)=0 or iparm(2)=2), my program doesn't crash.

I'm not sure if this is related, but I recently tried to replace PARDISO with MUMPS and had a very similar problem, which was resolved by replacing mumps_par%ICNTL(28) = 2 with mumps_par%ICNTL(28) = 1. I THINK using MUMPS and setting mumps_par%ICNTL(28) = 2 and mumps_par%ICNTL(29) = 2 is equivalent to using PARDISO and setting iparm(2)=3 since the documentation in both state "the ParMetis parallel ordering tool will be used to reorder the input matrix" and "The parallel (OpenMP) version of the nested dissection algorithm" for MUMPS and PARDISO respectively.

Hope this info helps in finding out why my program was crashing before.
Eugene

Gennady_F_Intel · ‎01-27-2014

thanks for letting us know about that. we will check how parallel version of the nested dissection affects.

PARDISO crashes randomly with different messages