Community
cancel
Showing results for 
Search instead for 
Did you mean: 
yang__xiaodong
New Contributor I
667 Views

Strange stack error causes program terminated without an error

Jump to solution

 

I'm an experienced fortran programmer, yet a really strange problem confused me.

As mentioned, the program terminates occasionally after the output

"============================ start performing inversion ..."

It seems a stack smash occurs.

The visual studio 2019 project (with oneapi 2021.3 installed) and needed files are attached. You could run the program with

DCFI3D -a mod1

MKL libraries are needed. Could anyone help find the error? Thanks very much.

0 Kudos
1 Solution
yang__xiaodong
New Contributor I
316 Views

Okay, now I find where the problem is.

It is because that when I use the MKL subroutine mkl_sparse_d_mm that has a proto of 

stat = mkl_sparse_d_mm (operation, alpha,A, descr, layout, B, columns, ldb,beta, C,ldc)

I declare the variable columns larger than what I need. Thus, it results in a C which takes more memory space than expected. This further causes heap corruption, and no error message could be generated by the Fortran compiler. 

View solution in original post

23 Replies
mecej4
Black Belt
581 Views

In order to reproduce the error, you require that we extract and run the EXE that you included in your zip file (GMSH.EXE), which is not only a large file (79 megabytes), but exposes the user to the possibility of viruses in such files. There could be questions regarding whether that file is permitted for open distribution in a forum such as this, as well.

Please run that EXE yourself to generate any data files, and provide those data files that DCFI3D needs in order to run. 

Along those lines, it would be far better if you can condense the program and data to a much smaller size.

yang__xiaodong
New Contributor I
558 Views

It is mine under consideration, and the attached files are updated now. Thanks for your attention.

yang__xiaodong
New Contributor I
552 Views

I'll try the old IVF compiler in PSXE, to see whether the program works well.

 

It seems IVF 19.0 also behaves the same way, the problem also exists.

mecej4
Black Belt
488 Views

Have you tried using the /heap-arrays option, which will place local arrays on the heap instead of the stack, as a means of reducing the stack size needed? I ran your program after building with that option, and it stopped with an access violation in subroutine INV_SCRIPT.

yang__xiaodong
New Contributor I
464 Views
sure i've tried that option. however, the problem is that the access violation error occurs randomly in subroutine inv_script. sometimes it just stops without an error, so i cannot locate it. do you have any advice aiming this kind of problem?
mecej4
Black Belt
450 Views

Here are a couple of suggestions to consider.

Run inside the Visual Studio debugger. When an access violation occurs, you may see more information about the line number, etc. In one instance, your program stopped with an attempt to access address 0000000000000024.

Try to isolate the problem. Capture the arguments passed to INV_SCRIPT into an unformatted file. Create a test program that just reads that file and calls INV_SCRIPT.

Try using a different compiler, such as Gfortran. However, your program uses features from the latest version of MKL, so this may not be feasible.

yang__xiaodong
New Contributor I
407 Views

Under Debug mode, the program crashes at line 319 of FWD_INV.f90, when INV_SCRIPT calling the subroutine system_solver

"if (allocated(dcmod%solve)) deallocate(dcmod%solve)"

which, to my point, is a pretty standard clause.

I'm afraid that it's because the program continues to run with some internal ill-posed RAM and finally crashes.

It is quite hard to tell the real position of the problem, I'll try to comment out some parts.

jimdempseyatthecove
Black Belt
302 Views

>>"if (allocated(dcmod%solve)) deallocate(dcmod%solve)"

>>which, to my point, is a pretty standard clause.

Quite true....

.... provided that dcmod is defined

... provided that dcmod%slove is defined

 

Note, an allocatable variable/array/udt can have three states: allocated, deallocated, and undefined

And an undefined variable used as argument to allocate/deallocate/allocated/reference will result in undefined behavior.

 

I suggest in debug mode that you break at that statement and verify that dcmod is defined, then verify if dcmod%solve is defined.

By this I mean that the variables appear to have valid addresses.

 

Also, if your code is using POINTERs, then your code may be using unin, itialized pointers .OR. dereferencing a pointer that at one time used to be valid, but is no longer valid. IOW addresses would look valid but are pointing at something else including returned space on heap/stack

Jim Dempsey

mecej4
Black Belt
278 Views

The only variables in the user's program with POINTER attribute have "sys_" in their names; dcmod%solve, etc., all have the ALLOCATABLE attribute and, therefore, their status is either allocated or unallocated -- they cannot have their status as undefined. That still leaves the possibilities of array bounds being exceeded, variables with values undefined, etc.

jimdempseyatthecove
Black Belt
186 Views

mecej4,

Unless Intel has fixed a long standing issue with Fortran there was an issue with passing in an unallocated array into an OMP parallel region using PRIVATE as opposed to FIRSTPRIVATE. And in those cases those allocatable variables were undefined. (firstprivate copied in the array descriptor's unallocated state.)

Jim Dempsey

mecej4
Black Belt
144 Views

Jim, I ran OP's code as a single-thread program (i.e., without /Qopenmp), and I did observe the access violation even then. The OMP issues that you just mentioned, if encountered when the same program is compiled with /Qopenmp and run, would be additional complications and the OP's fix (tagged as the answer) may not fix those issues.

yang__xiaodong
New Contributor I
247 Views

Thanks very much for your comment. I commented out lots of lines and tried hard to isolate the problems.

Finally, I found that it is an MKL-related problem. I declare more memory than needed, which directly causes heap corruption without any warning. Then, the program could continue running, yet it may crash at any related memory access.

jimdempseyatthecove
Black Belt
447 Views

Have you enabled the compile time diagnostics for interface checking...

and the runtime diagnostics for reads of uninitialized variables and array access out of bounds? (make first run test without optimizations).

Jim Dempsey

yang__xiaodong
New Contributor I
405 Views

Yeah, i've tried.

Under Debug mode with all options enabled, the program crashes without any error information at line 319 of FWD_INV.f90, when INV_SCRIPT calling the subroutine system_solver

"if (allocated(dcmod%solve)) deallocate(dcmod%solve)"

which, to my point, is a pretty standard clause.

I'm afraid that it's because the program continues to run with some internal ill-posed RAM and finally crashes.

It is quite hard to tell the real position of the problem, I'll try to comment out some parts.

mecej4
Black Belt
376 Views

Given the difficulty of debugging with a rather large data set, it may be worth the effort to see if the access violation can be exhibited with a much smaller test problem. Do you have such smaller input data files?

It is not clear what you mean by "internal ill-posed RAM". If you mean what is often called "memory corruption", that is certainly a possibility, and you could check by compiling with one of the /check options.

JohnNichols
Valued Contributor II
326 Views

I would not hurt to try to pull the statement apart such as 

 

logical yesno
integer error
yesno = allocated(dcmod%solve, stat = error)
write(*,*)yesno  
if(yesno)then
deallocate(dcmod%solve, stat = error) 
endif
write(*,*)error

 

There are excellent reasons why Fortran compilers provide excellent error messages and it does not hurt to use them. 

Case in point you send the program to someone and they tell you it does not work?  It is a long road to solve the problem if there are no error messages. 

Let the compiler worry about optimizing the code.  

 

 

JohnNichols
Valued Contributor II
321 Views

You have a very large generated mesh, given that humans on data entry make regular mistakes, the only way to see if the mesh is approximately correct is to view it -- autocad, rhino etc...  

How do you view it?  

How do you assure people the code is correct? 

yang__xiaodong
New Contributor I
242 Views

Thanks very much for your comment. I commented out lots of lines and tried hard to isolate the problems.

Finally, I found that it is an MKL-related problem. I declare more memory than needed, which directly causes heap corruption without any warning. Then, the program could continue running, yet it may crash at any related memory access. 

yang__xiaodong
New Contributor I
317 Views

Okay, now I find where the problem is.

It is because that when I use the MKL subroutine mkl_sparse_d_mm that has a proto of 

stat = mkl_sparse_d_mm (operation, alpha,A, descr, layout, B, columns, ldb,beta, C,ldc)

I declare the variable columns larger than what I need. Thus, it results in a C which takes more memory space than expected. This further causes heap corruption, and no error message could be generated by the Fortran compiler. 

View solution in original post

yang__xiaodong
New Contributor I
254 Views

People should take special care when using MKL functions, especially the input variables such as leading dimension, column size, and so on.

Reply