- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'm experiencing a rather odd circumstance and I'm looking for any advice on how to diagnose it or fix it. I'm implementing a sparse matrix solver, and I'm dividing up a matrix-vector product over a team of OpenMP threads using a do loop with static scheduling and balanced chunks of my matrix.
The problem is, my loop variable for the OpenMP do loop is getting optimized away when optimizations are turned on (-O1, -O2, -O3) and the loop is being run more times than intended.
In my debugging environment, I can only work with one thread ($OMP_NUM_THREADS=1 by admin), so this "loop" should behave like serial code. However, my debug messages indicate that my loop variable is going beyond 1, and idbc reports when I'm inside the loop
(idb) print i
Info: symbol i is defined but not allocated (optimized away)
Error: no value for symbol i
Cannot evaluate 'i'.
How should I go about figuring out what ifort has done in this optimization? Superficially, this acts like a bug, but I'm uncomfortable making that assertion without seeing exactly what the optimizations have done.
Thanks,
Jonathan
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Consider (inside parallel region)
(I is private)
[fortran]
do I = omp_get_thread_num() + 1, yourUpper, omp_get_num_threads()
...
end do
[/fortran]
Jim Dempsey
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
When debugging with one thread, add a variable
integer, volatile :: iCopy
Then inside your loop add
iCopy = i
If you are debugging with multiple threads, then iCopy can be an array, then
iCopy[omp_get_thread_num()] = i
You can use a private variable to keep a copy of omp_get_thread_num()
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks Jim,
The volatile variable helped, and will likely be very useful in the future. What it has revealed is that the optimized instructions ifort created do not check my loop bounds. Debugging statements printing off the value of my loop variable have been modified to print the expected value; the actual value (as reported by the volatile variable) does not change.
Is there a minimal expected number of loop iterations built into -O1, -O2, and -O3 which would remove the first loop bounds check?
Thanks,
Jonathan
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Fortran does not perform loops the same way that C/C++ performs loops.
In Fortran, the DO iteration space is examined at entry to DO to produce an iteration count (i.e. it becomes do the loop N times). From that point on the loop control variable might be a) not used, b) registerized, d) in the event of unrolling advance by unrolled count. At loop termination, the value of the loop control variable is the last one used (not next), or the initial setting should the loop not iterate.
If you manage to insert a break point into an optimized loop, the debugger should tell you the loop control variable is not available due to being registerized. If you are not seeing this message then either the debugger figured out the registerization or may be showing you the out of synch value of the non-registerized loop control variable.
Also note, modifying the loop control variable within the loop does not alter the iteration count.
And, inserting the iCopy=i in the loop may interfere with unrolling.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
jimdempseyatthecove wrote:
Fortran does not perform loops the same way that C/C++ performs loops.
In Fortran, the DO iteration space is examined at entry to DO to produce an iteration count (i.e. it becomes do the loop N times). From that point on the loop control variable might be a) not used, b) registerized, d) in the event of unrolling advance by unrolled count. At loop termination, the value of the loop control variable is the last one used (not next), or the initial setting should the loop not iterate.
If you manage to insert a break point into an optimized loop, the debugger should tell you the loop control variable is not available due to being registerized. If you are not seeing this message then either the debugger figured out the registerization or may be showing you the out of synch value of the non-registerized loop control variable.
Also note, modifying the loop control variable within the loop does not alter the iteration count.
And, inserting the iCopy=i in the loop may interfere with unrolling.
Jim Dempsey
In practice, the situation with C for loops is complicated enough that I've never seen it fully described. If you don't conform with optimizable patterns set by individual compilers, performance will suffer, or, with OpenMP, you don't get parallelization.
As Jim says, in Fortran, the number of iterations (for the case without EXIT or equivalent) is determined before entering the loop. Modifying the loop counter inside the loop is an error (since 1977). Some compilers may permit it in certain contexts (such as an EXIT block), as an extension. In C, with optimization, pre-calculated loop count may also be the case, but due to the standard not requiring it, there are more cases to deal with.
Contrary to what Jim said, a DO loop which terminates normally (not by EXIT...) will set the loop index variable to the next value, analogous to what you expect in C. Parallelization introduces possibilities in both Fortran and C for behavior to change; I've caught myself ignoring this problem.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
TimP,
From my: C:\Program Files (x86)\Intel\Composer XE 2011 SP1\Documentation\en_US\compiler_f\cl\index.htm
After termination, the DO variable retains its last value (the one it had when the iteration count was tested and found to be zero).
Is the document wrong?
Apparently so (there may be a compiler switch to alter this behavior)
[fortran]
DO I=1,3
WRITE(*,*) I
ENDDO
WRITE(*,*) I
1
2
3
4
[/fortran]
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
That "last value" is the value that the loop index variable had when the loop count was tested and found to be zero, and not the last value with which the body of the loop was actually executed.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The ambiguity is:
Is the test made at the top or bottom of the loop? Or initial top, then subsequently at the bottom?
Is the loop control variable stride-stepped at the top (after initial test), or at the bottom before test?
Regardless, the document should be clear on what happens (and what happens should be consistent with standards when it addresses the issue).
Jim
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Fortran standard is clearer on this point than the ifort document. At the time when the (f77) standard was adopted, compilers varied in whether they tested at the top or bottom, or even switched with optimization level. A compiler I used had 3 different treatments as side effects of other options, one of which conformed with f77.
It took quite a while for some compilers to comply with this, and I still don't count on it for cases involving parallelism, e.g. where the loop induction variable might need to be firstprivate or lastprivate (which aren't allowed).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you all for helping to clarify the loop test. There are a few points I need to make in reference to the past posts.
jimdempseyatthecove wrote:The variable is defined, but not allocated. I interpret this to mean that the variable is not being used at all. Copying the value to a volatile variable proved problematic, as the behavior of the program changed (-O0 started experiencing errors later in the code).
If you manage to insert a break point into an optimized loop, the debugger should tell you the loop control variable is not available due to being registerized. If you are not seeing this message then either the debugger figured out the registerization or may be showing you the out of synch value of the non-registerized loop control variable.
My problem is that it seems the loop count is not effectively being tested after the first iteration - it would not matter if the test was at the beginning or end of the code segment. The iteration count is 1 for this test case, and and it is continuing to iterations 2 and 3 before causing a segmentation fault when optimization levels -O1, -O2, or -O3 are used. Using -O0 results in functional (though obviously slower) code.
TimP (Intel) wrote:No EXIT or equivalent is present. Since the number of iterations is dependent on run-time conditions (number of OpenMP threads available), this cannot be pre-calculated by the compiler. Is there a way to access the --actual-- iteration count and current iteration used in the assembly compare instruction? Either the number of iterations is not being calculated correctly at run-time, the current iteration is not correct, or the test is not being performed after completing all instructions corresponding to the code block of the loop (necessary regardless of the position of the test in assembly instructions). I suspect those are the three most likely reasons that the bounds of my do loop are not being respected.
As Jim says, in Fortran, the number of iterations (for the case without EXIT or equivalent) is determined before entering the loop.
Does this sound reasonable? If so, any suggestions for hunting down the cause? I very much suspect this will lead to a bug report, but since my program has dependencies of MKL, Intel Lapack and BLAS, and a separate sparse matrix solving library, I'd like to get all my ducks in a row to explain the issue. Otherwise, the appropriate development team would not have much to work on.
Thanks,
Jonathan
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
*Correction: -O1 is not behaving as nicely as -O2 and -O3 are.
Segmentation fault upon entering a single region:
Program received signal SIGSEGV
__kmp_enter_single () in /opt/apps/intel/13/composer_xe_2013.2.146/compiler/lib/intel64/libiomp5.so
(idb) backtrace
#0 0x00002af9cc38d71a in __kmp_enter_single () in /opt/apps/intel/13/composer_xe_2013.2.146/compiler/lib/intel64/libiomp5.so
#1 0x00002af9cc370e16 in __kmpc_single () in /opt/apps/intel/13/composer_xe_2013.2.146/compiler/lib/intel64/libiomp5.so
I thought this had been cleaned up by recompiling my sparse matrix solver library, but it turns out I was wrong. Still, I seriously doubt this issue is related to the original one, so I'll start a thread on it later if it still annoys me. The target optimization for the final program is -O3.
So, let's restrict optimizations considered for this to -O2 and -O3, which experience the same symptoms.
Jonathan
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If you intend to run a DO loop over num_threads, would you not use something like
use omp_lib
!$omp parallel private(nt)
nt=omp_get_num_threads()
!$omp do
do i = 1,nt
....
end do
!$omp end parallel
Nearly everything in OpenMP depends on loop counts being calculated before entering the loop. So there's no reason here to rebel against the Fortran standard. In fact, if you have an OpenMP loop in which later iterations may have nothing to do, you typically need to let them spin without exit, e.g.
if(nomorework)cycle
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
This is an abbreviation of what I'm working with:
[fortran]integer, dimension(:), allocatable :: chunk
...
!$omp parallel shared(chunk)
!$omp master
allocate(chunk(omp_get_num_threads()+1))
...
! find appropriate divisions of sparse
! matrix for omp_get_num_threads()
! pieces
...
!$omp end master
!$omp end parallel
...
!$omp parallel shared(chunk, ...)
!$omp private(i,j,k,...)
! allocate local result array
...
!$omp do schedule(static)
do i=1,size(chnk)-1
do j=1,...
do k=...
localResult(j) = localResult(j) + Nonzero(k) * InputVector(k)
end do
end do
!$omp critical
do j=1,MatrixSize
FinalResult(j) = FinalResult(j) + localResult(j)
end do
!$omp end critical
...
end do
!$omp end do
!$omp end parallel
deallocate(chunk)[/fortran]
If I'm forced to write N copies where N is the maximum number of threads per node I anticipate using, I can, but that's not really the cleanest code.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I should also add that this scheme has worked fine in the past - this error only popped up when I took advantage of Hermitian symmetry and added the reduction phase (critical section).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Jonathan,
In the sketch code you showed, localResult would have to be private. This may be an omission in producing the code snip, but it could also be a coding oversight in your current code.
Also, the sketch code does not show how the thread unique "i"'s disambiguate the data. IOW is there some code between "do i" and "do j" that selects the stripe of data unique to each thread?
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Jim,
You are correct, localResult must be private, and it is in my actual code; I paraphrased too quickly. And despite the fact that I edited the code example, the j loop edit did not make it into the post.
[fortran]do j=chunk(i)+1,chunk(i+1)[/fortran] are the bounds on the j loop. The chunk array is what disambiguates the data. Also, MatrixSize is a global parameter, Nonzero is a global allocated array (both of which are encapsulated in a module), and finalResult is shared.
Jonathan
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>do j=chunk(i)+1,chunk(i+1)
Insert if(j.eq.chunk(i)+1) write(*,*) omp_thread_num(), chunk(i)+1,chunk(i+1)
(insert immediatly following do j=...)
Just to verify non-overlapping chunks
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Relevant output:
[plain] Time to load hamiltonian into memory: 3.0000000E-03
Chunk:
0
43
Size of Chunk: 2
Time to complete Lanczos iteration: 6.0999999E-03
Dimension of localResult: 43
Dimension of workingArray: 129
Pointer to input vector first element: 87
Pointer to output vector first element: 44
Bounds on loop var i: 1 1
Bounds on 1 th loop var j: 1 43
Bounds on 1 th loop var k: 1 339
Loop var analysis complete.
Made it into i loop. Variable i= 1
Zeroed working array.
0 1 43
Made it into j loop. Variable j= 1
Made it into k loop. Variable k= 1
Made it into k loop. Variable k= 2
Made it into k loop. Variable k= 3
Made it into k loop. Variable k= 4
Made it into k loop. Variable k= 5
Made it into j loop. Variable j= 2
Made it into k loop. Variable k= 6
Made it into k loop. Variable k= 7
Made it into k loop. Variable k= 8
Made it into k loop. Variable k= 9
...
Made it into j loop. Variable j= 43
Made it into k loop. Variable k= 339
Reducing matrix-vector product.
Reduction complete.
Made it into i loop. Variable i= 2
Zeroed working array.
Reducing matrix-vector product.
Reduction complete.
Made it into i loop. Variable i= 3
Zeroed working array.
0 -860295351 56
Made it into j loop. Variable j= -860295351
forrtl: severe (174): SIGSEGV, segmentation fault occurred[/plain]
I just read the section on do constructs in the Fortran 95 spec. I'm not doing anything that is nonstandard. The iteration count can be calculated during runtime prior to entry into the loop block. I even modified the code by using
[fortran]do i=1,omp_get_num_threads()[/fortran]
which produced identical results.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
But disturbingly, I just logged into the server again and this output occurred:
[plain] Time to load hamiltonian into memory: 2.4999999E-03
Chunk:
0
43
Size of Chunk: 2
Time to complete Lanczos iteration in ARPACK: 6.0000003E-04
Dimension of prodPart: 43
Dimension of workd: 129
Pointer to input vector first element: 87
Pointer to output vector first element: 44
Bounds on loop var i: 1 1
Bounds on 1 th loop var j: 1 43
Bounds on 1 th loop var k: 1 339
Loop var analysis complete.
Made it into i loop. Variable i= 1
Zeroed working array.
0 1 43
Made it into j loop. Variable j= 1
Made it into k loop. Variable k= 1
Made it into k loop. Variable k= 2
Made it into k loop. Variable k= 3
Made it into k loop. Variable k= 4
Made it into k loop. Variable k= 5
Made it into j loop. Variable j= 2
Made it into k loop. Variable k= 6
Made it into k loop. Variable k= 7
Made it into k loop. Variable k= 8
Made it into k loop. Variable k= 9
Made it into k loop. Variable k= 10
...
Made it into j loop. Variable j= 43
Made it into k loop. Variable k= 339
Reducing matrix-vector product.
Reduction complete.
Made it into i loop. Variable i= 2
Zeroed working array.
0 44 146337608
Made it into j loop. Variable j= 44
Made it into j loop. Variable j= 45
Made it into j loop. Variable j= 46
Made it into k loop. Variable k= 0
Made it into k loop. Variable k= 1
Made it into k loop. Variable k= 2
Made it into k loop. Variable k= 3
Made it into k loop. Variable k= 4
Made it into k loop. Variable k= 5
Made it into k loop. Variable k= 6
Made it into k loop. Variable k= 7
Made it into k loop. Variable k= 8
Made it into k loop. Variable k= 9
Made it into k loop. Variable k= 10
...
Made it into k loop. Variable k= 343
Made it into k loop. Variable k= 344
Made it into k loop. Variable k= 345
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
diagonalize_8 0000000000411360 Unknown Unknown Unknown
libiomp5.so 00002B5ABEFCDFE3 Unknown Unknown Unknown[/plain]
I have no idea why the behavior would vary like this, but at least the continuation of the loop (i=2) is consistent with the previous output.
Jonathan
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It turns out that the difference in behavior is because I was testing on a different login node. To the best of my knowledge, the environment is uniform across all nodes in the system, but I'm verifying that with the sysadmins. Still, $OMP_NUM_THREADS is set to 1 for all login nodes, so my executable should not vary due to system load, right?
Anyone know of conditions in the OpenMP library that would cause an executable using only one thread to vary between two systems with identical hardware and software?
Thanks,
Jonathan
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In your printout it lists:
Bounds on loop var i: 1 1
Your do i=1,omp_thread_num() is producing an i larger than the upper bound of the i index in your array.
Therefore you will requires immediately following the do i=1,omp_thread_num() a statement like
if(i .gt. ubound(...)) exit
Where ... is replaced by the proper reference to obtain the bounds of the array indexed by i (same way as you obtained bounds for above report).
If your i bounds will be small in your production version (IOW smaller than thread count), then consider moving the parallization inwards.
Jim Dempsey
![](/skins/images/7B0AB6865064EAE32B30EC9A4E94B48A/responsive_peak/images/icon_anonymous_message.png)
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page