problem with openmp and chayracter arrays

Alexis_M_ · ‎08-10-2011

Hi everyone,

I decided to try to use openmp to parallelize some code. It was not working, and through trial and error I have found that the problem lies with a dynamic character string (it's size depends on another variable passed to the subroutine).

Here is the simplified version of my code:

[bash]   character(len=Npart*3) :: key

!$OMP PARALLEL DEFAULT(SHARED) PRIVATE(i,j,key)
   do r=1,basis_dim 
      do i=1,Npart    
         do j=i+1,Npart !
               
              (some stuff)

               write(key,trim(formatting))mlist,char(0)
                
              (some other stuff)
                 
            enddo
         enddo
      enddo 
   enddo
   !$OMP END DO
!$OMP END PARALLEL  
[/bash]

When I run this code, I get a various segmentation faults. However, if I specify a fixed string length like

character(len=5) :: key

everything works perfectly.

Any ideas what the problem is?

Thank you for your help!

Alexis

if001 · ‎08-10-2011

Hello Alexis!

CHARACTER(LEN=NPART*3) :: KEY

What is NPART? First, this must be declared as PARAMETER! It cannot be variable.

Max.

Alexis_M_ · ‎08-10-2011

Hi Max, thank you for your help. I was rushing a bit and didn't properly identify the variables in my sample code above... sorry!

Npart is an integer that is initialized in the main calling program. It represents the number of particles that my code will simulate, and since this is chosen by the user, I can't set this as a parameter.

I thought that this would work fine with openmp 3 since allocatable arrays are allowed with the private attribute are allowed inside openmp parallel regions. I tried rewriting my code using an allocatable character array, but this fails also with a segmentation fault.

I also noticed strange behaviour with private allocatable integer arrays. Could my problems be related to this thread?

jimdempseyatthecove · ‎08-10-2011

>>I thought that this would work fine with openmp 3 since allocatable arrays are allowed with the private attribute are allowed inside openmp parallel regions.

If your are having problems with this then see if you can convert the entire code between !$OMP PARALLEL and !$OMP END PARALLEL into a subroutine and then call that from within the parallel region. This may require to to pass args that you wouldn't ordinarily pass. I know this works because I have used this myself with some of the earlier compiler versions that had similar problems.

I have not tried this with a CONTAINS subroutine called from within a parallel region (to verify if the/a problem exists with the host routine local variables).

Jim Dempsey

Alexis_M_ · ‎08-12-2011

Thank you for the suggestion Jim. I tried putting the code between the !$OMP PARALLEL and !$OMP END PARALLEL into a subroutine, but I still get the segmentation fault.

If I simply hard code the correct length of the "key" string for a test case:

[bash] CHARACTER(LEN=15) :: KEY[/bash]

I don't get any runtime errors, and the code produces correct results. This leads me to think that there is a bug in the compiler.

jimdempseyatthecove · ‎08-12-2011

How big is Npart? (this wasn't discussed)

Might Npart be corrupt or uninitialized?

Prior to the !$OMP region insert a break point and examine Npart. Assure that it is what you think it is.

In particular, if you do not have IMPLICIT NONE and Npart is not declared (in COMMON or as Module) and provided reference to in your codethen the compiler will generate code using a default integer reserved on the stack _with_ uninitialized data. This could be any kind of junk. (e.g. too large for stack allocation) additionally the junk value could be too small resulting in your code destroying the return address for the call.

If that looks ok, then place the break point on the CALL in the (new) !$OMP PARALLEL region, and if you can run with 1 thread, open a dissassembly window after break, then single step into the routine. Don't be intimidated by assembly code, it is not all that hard to understand. You should be able to see the value the code obtains for the Npart and the stack reservation (subtraction from esp, rsp or call to alloca). Knowing what is going on will help you to decide if it is your programming problem or in the compiler. I doubt it if this is a compiler error - but I do not discount the possibility either.

Jim Dempsey

Alexis_M_ · ‎08-17-2011

Hi Jim,

Thank you for the suggestions. I have done some more testing, ensuring that Npart is initialized properly and I still get the segmentation faults. Npart is very small (10 - 20), so I don't think this is the problem. However, I did narrow down the issue somewhat. It seems that the source of the segmentation faults is the intel mkl library. When I compile a simple code, and link the mkl library, I get the faults. If there is no mkl, I don't get any problems.

Here's a simple example.

sim.f90:

[bash]   program sim
   use test_module
   implicit none
   integer :: Npart
   Npart=5
   call test(Npart)
   end program sim
[/bash]

test_module.f90:

[bash]   MODULE test_module

   contains

   subroutine test(Npart)
   implicit none
   integer, intent(in) :: Npart
   integer :: i   
   character(len=Npart) :: key
   !$OMP PARALLEL PRIVATE(key)
   !$OMP DO
   do i=1,10
      write(key,*)'hello'
   enddo
   !$OMP END DO
   !$OMP END PARALLEL  
   end subroutine test

   END MODULE test_module
[/bash]

makefile:

[bash]F95ROOT    = /opt/intel
FC         = ifort
FFLAGS     = -openmp -I$(F95ROOT)/include/ia32  -mkl=parallel 
 
sim: sim.f90 test_module.o
	${FC} ${FFLAGS} -o sim sim.f90 test_module.o  

test_module.o: test_module.f90
	${FC} ${FFLAGS} -c test_module.f90 

clean:
	@${RM} *.o *.mod *~


[/bash]

When I compile this without the mkl, I don't have any problems. When I compile with the makefile above, running the code sometimes works, sometimes gives me the following (which I find strange in itself):

[bash]forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC        Routine            Line        Source             
sim                080AA62C  Unknown               Unknown  Unknown
sim                0804A7E8  Unknown               Unknown  Unknown
libiomp5.so        00AC43ED  Unknown               Unknown  Unknown
libiomp5.so        00AA4B7E  Unknown               Unknown  Unknown
libiomp5.so        00AA03F1  Unknown               Unknown  Unknown
libiomp5.so        00A878BA  Unknown               Unknown  Unknown
sim                0804A6C3  Unknown               Unknown  Unknown
sim                0804A644  Unknown               Unknown  Unknown
sim                0804A5C4  Unknown               Unknown  Unknown
libc.so.6          0045AE37  Unknown               Unknown  Unknown
sim                0804A4D1  Unknown               Unknown  Unknown
[/bash]

Sometimes I have to run the code 5 or 6 times before I get the segmenation fault. My versions of ifort and mkl are:
ifort (IFORT) 12.0.0 20101006
mkl: whatever came with composerxe-2011.0.084

Thanks again for your help!

jimdempseyatthecove · ‎08-18-2011

I'm on Windows 7 x64so I cannot duplicate your situation. MKL linkage affecting your program seems strange.All this would do is change the location of your code (unless there is some sort of hidden C++ ctor activity going on)

Tryreplacing

write(key,*) 'hello'
with
WRITE(*,*) omp_get_thread_num(), LEN_TRIM('x'//key//'x')

see what you get for the length of key+2

Then try adding COPYIN(KEY) to the !$OMP PARALLEL
And run the LEN_TRIM test again.

Jim Dempsey

Alexis_M_ · ‎08-22-2011

Hi Jim,

I finally got the code to work properly. Using the COPYIN clause, didn't fix the segmentation faults, but changing PRIVATE(key) to FIRSTPRIVATE(key) did, strangely enough. After some more testing I discovered that while the segmentation fault was gone, my code did not produce correct, or reproducible results when compiled with the -openmp flag. Even though "key" should have been private to each thread, I found that this was not the case, and all threads would be writing to the same key. It's as if the compiler ignored the PRIVATE/FIRSTPRIVATE option.

Eventually, I solved all my problems by declaring "key" as being allocatable, and allocating/deallocating it within the scope of the openmp parallel region:

[bash]character(len=:), allocatable :: key
       
   ...

!$OMP PARALLEL PRIVATE(key)
   allocate(character(len=Npart) :: key)
   
     ...
 
   deallocate(key)
!$OMP END PARALLEL
[/bash]

This new code works with mkl too (so it probably never was the culprit). I still think that there is a problem with the compiler as this new code seems to me to be completely equivalent to my old one. Thanks for your help!

jimdempseyatthecove · ‎08-23-2011

IMHO not capturing the len part of the character array is a bug... but this may not have been addressed in the OpenMP standards. The standards committee are usually quite good at finding potential omissions like this so we will have to see what Steve or someone else on the committee says about this. Good catch at using ALLOCATABLE. If you wish to avoid allocatable (has critical section and additional heap overhead) try:

MODULE test_module

contains

subroutine test_two(Npart)

implicit none

integer, intent(in) :: Npart

integer :: i

character(len=Npart) :: key

!$OMP DO

do i=1,10

write(key,*)'hello'

enddo

!$OMP END DO

end subroutine test_two

subroutine test(Npart)

implicit none

integer, intent(in) :: Npart

integer :: i

!$OMP PARALLEL

call test_two(Npart)

!$OMP END PARALLEL

end subroutine test

------------- .OR. ---------------

subroutine test_two(Npart)

implicit none

integer, intent(in) :: Npart

integer :: i

character(len=Npart) :: key

write(key,*)'hello'

end subroutine test_two

subroutine test(Npart)

implicit none

integer, intent(in) :: Npart

integer :: i

!$OMP PARALLEL

!$OMP DO

do i=1,10

call test_two(Npart)

enddo

!$OMP END DO

!$OMP END PARALLEL

end subroutine test
END MODULE test_module

Jim Demspey

Alexis_M_ · ‎08-23-2011

Well my celebrations were shorted lived---as soon as I used put another automatic array in my main code that needs to be set as private, I get once again results that are unpredicatable. Changing this array to ALLOCATABLE, as in my previous post, did not solve the problem.

I then tried your suggestions with the simple test cases. I installed a trial version of inspector XE 2011 so see if it could shed some light on the problem. When I make it analyse the test simple test code above, it gives me the following:

warning: Cross-thread stack access (it then highlights the code following code: !$OMP PARALLEL)

error: Data race (it highlights the write statement: write(key,*)'hello')

I don't understand the data race error. If "key" is defined as being private, how could a race condition occur? Also, when using your alternate versions, doesn't each thread get their own variable "key" since each thread calls a separate version of the test_two subroutine?

I don't also understand the "Cross-thread" stack access warning. Should I be concerned?

The way things are going, I'm probably going to completely abandon using openmp. While the speed of my code would greatly benefit, it seems like each step of the way I am getting completely unexplained errors and problems.

jimdempseyatthecove · ‎08-23-2011

If you used the sample code I provided each flavor of test_two is running in an independent thread context spawned by subroutine test. Therefore "key" is in seperate stack (assuming you did not alter the code to pass key by reference).

The only cross thread stack access would be for the length variable passed through the subroutines. This is read-only and should hardly warrant a warning.

Can you post your "broken" example code (coimplete example). I'll run it here (On Windows 7 x64)

Jim Dempsey

Alexis_M_ · ‎08-23-2011

Here is the code:

sim.f90

[bash]    program sim  
    use test_module  
    implicit none  
    integer :: Npart  
    Npart=5  
    call test(Npart)  
    end program sim  
[/bash]

test_module.f90

[bash]   MODULE test_module
   contains

   subroutine test_two(Npart)
   implicit none
   integer, intent(in) :: Npart
   character(len=Npart) :: key
   write(key,'(a5)')'hello'
   end subroutine test_two

   subroutine test(Npart)
   implicit none
   integer, intent(in) :: Npart
   integer :: i
   !$OMP PARALLEL 
   !$OMP DO
   do i=1,10
      call test_two(Npart)
   enddo
   !$OMP END DO
   !$OMP END PARALLEL
   end subroutine test

   END MODULE test_module[/bash]

Looking back, it seems I had inadvertently changed the Cross-thread stack access detection to: "Show problems/Hide warnings", which is probably why the it reported the cross-thread stack access. However, I still get numerous data race errors using the last two threading error analysis prebuilt options.

Thank you again for your time in helping me figure this out.

Alexis

jimdempseyatthecove · ‎08-24-2011

>>However, I still get numerous data race errors using the last two threading error analysis prebuilt options

I doubt if the data race reports has anything to do with your code. Most likely the OpenMP thread team build-up/tear-down is getting the race conditions.

change your "do i=1,10" to "do i=1,100000" (at least 10 seconds of run time) and run a profile.

Jim Dempsey

Alexis_M_ · ‎09-08-2011

Hi Jim,

I think that you were right regarding the race condition and the openmp thread build-up/tear-down. Further investigation showed me that my new segmentation fault resulted when the openmp threads had to create large temporary matrices. Even though I have set up my environment to use an unlimited stack size (via "ulimit -s unlimited" in the terminal on my linux box) , I still got the segmentation fault with large matrices. Looking through an intel document on debugging segmentation faults (I can't find the page now), I got the idea of compiling with the "-heap-arrays" flag, and it worked. The computation is now marginally slower, but I can live with that. Interestingly, the document did not recommend the -heap-arrays flag in combination with openmp. Should I be concerned?

Regardless, I can now move forward. My next problem is optimizing the code a little. It turns out that the biggest bottleneck is re-initializing a temporary array "rowtmp" that has dimension "dim", which can get quite large (on the order of a million). This was surprising to me since there's lots of stuff that happens after inside the loop (including other loops), and this initialization takes about 30-40% of the computational time.

[bash]do r=1, dim
      rowtmp=0_r8
     
      ... do lots of stuff

 enddo  [/bash]

Thanks again for all your help!

jimdempseyatthecove · ‎09-09-2011

>>my environment to use an unlimited stack size (via "ulimit -s unlimited" in the terminal on my linux box

This will set the main (PROGRAM) thread stack size to unlimited. Use environment variable KMP_STACKSIZE (or OMP_STACKSIZE) or CALL KMP_SET_STACKSIZE_S(size) prior to your first parallel region.

Heap arrays may be a better option on x32 platform.... run some tests to verify.
Note, "unlimited" stack size on x32 isn't really unlimited. The virtual address space (4GB) must get partitioned amoungst: static data, code, main thread, heap, stacks for each additional thread.

Your applicaiton may be constructed to use fewer than all threads to run a loopwhere each threadconsumes large arrays on stack. When all threads (on x32) are given large stacks, then some threads will not be using all of the (reserved) virtual memory address space partitioned out for use as stack.

Use of heap arrays and controlling the number of threads taking massive amounts of "scratch" memory may be the only viable solution to running data hog programs on x32 (MPI or coarrays is an alternative).

On x64 you could partition out 10's of thousands of threads worth of virtual address for use in stack without much worry. However, using heap arrays (for temporaries used by fewer than all threads) will (may)reduce page file consumption due to returned memory being available for use by any subsequent thread.

Note, in a long running program, where you allocate/deallocate these large blocks frequently consider adding a pool for these blocks. If your application can initally allocate the working set, then you will have no subsequent issue with memory fragmentation. (plus allocation/deallocation from your pool can be very low overhead)

Jim Dempsey