- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I decided to try to use openmp to parallelize some code. It was not working, and through trial and error I have found that the problem lies with a dynamic character string (it's size depends on another variable passed to the subroutine).
Here is the simplified version of my code:
[bash] character(len=Npart*3) :: key !$OMP PARALLEL DEFAULT(SHARED) PRIVATE(i,j,key) do r=1,basis_dim do i=1,Npart do j=i+1,Npart ! (some stuff) write(key,trim(formatting))mlist,char(0) (some other stuff) enddo enddo enddo enddo !$OMP END DO
!$OMP END PARALLEL [/bash]
When I run this code, I get a various segmentation faults. However, if I specify a fixed string length like
character(len=5) :: key
everything works perfectly.
Any ideas what the problem is?
Thank you for your help!
Alexis
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
CHARACTER(LEN=NPART*3) :: KEY
What is NPART? First, this must be declared as PARAMETER! It cannot be variable.
Max.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Npart is an integer that is initialized in the main calling program. It represents the number of particles that my code will simulate, and since this is chosen by the user, I can't set this as a parameter.
I thought that this would work fine with openmp 3 since allocatable arrays are allowed with the private attribute are allowed inside openmp parallel regions. I tried rewriting my code using an allocatable character array, but this fails also with a segmentation fault.
I also noticed strange behaviour with private allocatable integer arrays. Could my problems be related to this thread?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If your are having problems with this then see if you can convert the entire code between !$OMP PARALLEL and !$OMP END PARALLEL into a subroutine and then call that from within the parallel region. This may require to to pass args that you wouldn't ordinarily pass. I know this works because I have used this myself with some of the earlier compiler versions that had similar problems.
I have not tried this with a CONTAINS subroutine called from within a parallel region (to verify if the/a problem exists with the host routine local variables).
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If I simply hard code the correct length of the "key" string for a test case:
[bash] CHARACTER(LEN=15) :: KEY[/bash]I don't get any runtime errors, and the code produces correct results. This leads me to think that there is a bug in the compiler.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Might Npart be corrupt or uninitialized?
Prior to the !$OMP region insert a break point and examine Npart. Assure that it is what you think it is.
In particular, if you do not have IMPLICIT NONE and Npart is not declared (in COMMON or as Module) and provided reference to in your codethen the compiler will generate code using a default integer reserved on the stack _with_ uninitialized data. This could be any kind of junk. (e.g. too large for stack allocation) additionally the junk value could be too small resulting in your code destroying the return address for the call.
If that looks ok, then place the break point on the CALL in the (new) !$OMP PARALLEL region, and if you can run with 1 thread, open a dissassembly window after break, then single step into the routine. Don't be intimidated by assembly code, it is not all that hard to understand. You should be able to see the value the code obtains for the Npart and the stack reservation (subtraction from esp, rsp or call to alloca). Knowing what is going on will help you to decide if it is your programming problem or in the compiler. I doubt it if this is a compiler error - but I do not discount the possibility either.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you for the suggestions. I have done some more testing, ensuring that Npart is initialized properly and I still get the segmentation faults. Npart is very small (10 - 20), so I don't think this is the problem. However, I did narrow down the issue somewhat. It seems that the source of the segmentation faults is the intel mkl library. When I compile a simple code, and link the mkl library, I get the faults. If there is no mkl, I don't get any problems.
Here's a simple example.
sim.f90:
[bash] program sim use test_module implicit none integer :: Npart Npart=5 call test(Npart) end program sim [/bash]test_module.f90:
[bash] MODULE test_module contains subroutine test(Npart) implicit none integer, intent(in) :: Npart integer :: i character(len=Npart) :: key !$OMP PARALLEL PRIVATE(key) !$OMP DO do i=1,10 write(key,*)'hello' enddo !$OMP END DO !$OMP END PARALLEL end subroutine test END MODULE test_module [/bash]
makefile:
[bash]F95ROOT = /opt/intel FC = ifort FFLAGS = -openmp -I$(F95ROOT)/include/ia32 -mkl=parallel sim: sim.f90 test_module.o ${FC} ${FFLAGS} -o sim sim.f90 test_module.o test_module.o: test_module.f90 ${FC} ${FFLAGS} -c test_module.f90 clean: @${RM} *.o *.mod *~ [/bash]When I compile this without the mkl, I don't have any problems. When I compile with the makefile above, running the code sometimes works, sometimes gives me the following (which I find strange in itself):
[bash]forrtl: severe (174): SIGSEGV, segmentation fault occurred Image PC Routine Line Source sim 080AA62C Unknown Unknown Unknown sim 0804A7E8 Unknown Unknown Unknown libiomp5.so 00AC43ED Unknown Unknown Unknown libiomp5.so 00AA4B7E Unknown Unknown Unknown libiomp5.so 00AA03F1 Unknown Unknown Unknown libiomp5.so 00A878BA Unknown Unknown Unknown sim 0804A6C3 Unknown Unknown Unknown sim 0804A644 Unknown Unknown Unknown sim 0804A5C4 Unknown Unknown Unknown libc.so.6 0045AE37 Unknown Unknown Unknown sim 0804A4D1 Unknown Unknown Unknown [/bash]
Sometimes I have to run the code 5 or 6 times before I get the segmenation fault. My versions of ifort and mkl are:
ifort (IFORT) 12.0.0 20101006
mkl: whatever came with composerxe-2011.0.084
Thanks again for your help!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Tryreplacing
write(key,*) 'hello'
with
WRITE(*,*) omp_get_thread_num(), LEN_TRIM('x'//key//'x')
see what you get for the length of key+2
Then try adding COPYIN(KEY) to the !$OMP PARALLEL
And run the LEN_TRIM test again.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I finally got the code to work properly. Using the COPYIN clause, didn't fix the segmentation faults, but changing PRIVATE(key) to FIRSTPRIVATE(key) did, strangely enough. After some more testing I discovered that while the segmentation fault was gone, my code did not produce correct, or reproducible results when compiled with the -openmp flag. Even though "key" should have been private to each thread, I found that this was not the case, and all threads would be writing to the same key. It's as if the compiler ignored the PRIVATE/FIRSTPRIVATE option.
Eventually, I solved all my problems by declaring "key" as being allocatable, and allocating/deallocating it within the scope of the openmp parallel region:
[bash]character(len=:), allocatable :: key ... !$OMP PARALLEL PRIVATE(key) allocate(character(len=Npart) :: key) ... deallocate(key) !$OMP END PARALLEL
[/bash]
This new code works with mkl too (so it probably never was the culprit). I still think that there is a problem with the compiler as this new code seems to me to be completely equivalent to my old one. Thanks for your help!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
MODULE test_module
contains
subroutine test_two(Npart)
implicit none
integer, intent(in) :: Npart
integer :: i
character(len=Npart) :: key
!$OMP DO
do i=1,10
write(key,*)'hello'
enddo
!$OMP END DO
end subroutine test_two
subroutine test(Npart)
implicit none
integer, intent(in) :: Npart
integer :: i
!$OMP PARALLEL
call test_two(Npart)
!$OMP END PARALLEL
end subroutine test
------------- .OR. ---------------
subroutine test_two(Npart)
implicit none
integer, intent(in) :: Npart
integer :: i
character(len=Npart) :: key
write(key,*)'hello'
end subroutine test_two
subroutine test(Npart)
implicit none
integer, intent(in) :: Npart
integer :: i
!$OMP PARALLEL
!$OMP DO
do i=1,10
call test_two(Npart)
enddo
!$OMP END DO
!$OMP END PARALLEL
end subroutine test
END MODULE test_module
Jim Demspey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I then tried your suggestions with the simple test cases. I installed a trial version of inspector XE 2011 so see if it could shed some light on the problem. When I make it analyse the test simple test code above, it gives me the following:
warning: Cross-thread stack access (it then highlights the code following code: !$OMP PARALLEL)
error: Data race (it highlights the write statement: write(key,*)'hello')
I don't understand the data race error. If "key" is defined as being private, how could a race condition occur? Also, when using your alternate versions, doesn't each thread get their own variable "key" since each thread calls a separate version of the test_two subroutine?
I don't also understand the "Cross-thread" stack access warning. Should I be concerned?
The way things are going, I'm probably going to completely abandon using openmp. While the speed of my code would greatly benefit, it seems like each step of the way I am getting completely unexplained errors and problems.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The only cross thread stack access would be for the length variable passed through the subroutines. This is read-only and should hardly warrant a warning.
Can you post your "broken" example code (coimplete example). I'll run it here (On Windows 7 x64)
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
sim.f90
[bash] program sim use test_module implicit none integer :: Npart Npart=5 call test(Npart) end program sim
[/bash]
test_module.f90
[bash] MODULE test_module contains subroutine test_two(Npart) implicit none integer, intent(in) :: Npart character(len=Npart) :: key write(key,'(a5)')'hello' end subroutine test_two subroutine test(Npart) implicit none integer, intent(in) :: Npart integer :: i !$OMP PARALLEL !$OMP DO do i=1,10 call test_two(Npart) enddo !$OMP END DO !$OMP END PARALLEL end subroutine test END MODULE test_module[/bash]
Looking back, it seems I had inadvertently changed the Cross-thread stack access detection to: "Show problems/Hide warnings", which is probably why the it reported the cross-thread stack access. However, I still get numerous data race errors using the last two threading error analysis prebuilt options.
Thank you again for your time in helping me figure this out.
Alexis
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I doubt if the data race reports has anything to do with your code. Most likely the OpenMP thread team build-up/tear-down is getting the race conditions.
change your "do i=1,10" to "do i=1,100000" (at least 10 seconds of run time) and run a profile.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I think that you were right regarding the race condition and the openmp thread build-up/tear-down. Further investigation showed me that my new segmentation fault resulted when the openmp threads had to create large temporary matrices. Even though I have set up my environment to use an unlimited stack size (via "ulimit -s unlimited" in the terminal on my linux box) , I still got the segmentation fault with large matrices. Looking through an intel document on debugging segmentation faults (I can't find the page now), I got the idea of compiling with the "-heap-arrays" flag, and it worked. The computation is now marginally slower, but I can live with that. Interestingly, the document did not recommend the -heap-arrays flag in combination with openmp. Should I be concerned?
Regardless, I can now move forward. My next problem is optimizing the code a little. It turns out that the biggest bottleneck is re-initializing a temporary array "rowtmp" that has dimension "dim", which can get quite large (on the order of a million). This was surprising to me since there's lots of stuff that happens after inside the loop (including other loops), and this initialization takes about 30-40% of the computational time.
[bash]do r=1, dim rowtmp=0_r8
... do lots of stuff
enddo [/bash]
Thanks again for all your help!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
This will set the main (PROGRAM) thread stack size to unlimited. Use environment variable KMP_STACKSIZE (or OMP_STACKSIZE) or CALL KMP_SET_STACKSIZE_S(size) prior to your first parallel region.
Heap arrays may be a better option on x32 platform.... run some tests to verify.
Note, "unlimited" stack size on x32 isn't really unlimited. The virtual address space (4GB) must get partitioned amoungst: static data, code, main thread, heap, stacks for each additional thread.
Your applicaiton may be constructed to use fewer than all threads to run a loopwhere each threadconsumes large arrays on stack. When all threads (on x32) are given large stacks, then some threads will not be using all of the (reserved) virtual memory address space partitioned out for use as stack.
Use of heap arrays and controlling the number of threads taking massive amounts of "scratch" memory may be the only viable solution to running data hog programs on x32 (MPI or coarrays is an alternative).
On x64 you could partition out 10's of thousands of threads worth of virtual address for use in stack without much worry. However, using heap arrays (for temporaries used by fewer than all threads) will (may)reduce page file consumption due to returned memory being available for use by any subsequent thread.
Note, in a long running program, where you allocate/deallocate these large blocks frequently consider adding a pool for these blocks. If your application can initally allocate the working set, then you will have no subsequent issue with memory fragmentation. (plus allocation/deallocation from your pool can be very low overhead)
Jim Dempsey
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page