Solved: Loop variable optimized away

Jonathan_B_ · ‎11-27-2013

I'm experiencing a rather odd circumstance and I'm looking for any advice on how to diagnose it or fix it. I'm implementing a sparse matrix solver, and I'm dividing up a matrix-vector product over a team of OpenMP threads using a do loop with static scheduling and balanced chunks of my matrix.

The problem is, my loop variable for the OpenMP do loop is getting optimized away when optimizations are turned on (-O1, -O2, -O3) and the loop is being run more times than intended.

In my debugging environment, I can only work with one thread ($OMP_NUM_THREADS=1 by admin), so this "loop" should behave like serial code. However, my debug messages indicate that my loop variable is going beyond 1, and idbc reports when I'm inside the loop

(idb) print i
Info: symbol i is defined but not allocated (optimized away)
Error: no value for symbol i
Cannot evaluate 'i'.

How should I go about figuring out what ifort has done in this optimization? Superficially, this acts like a bug, but I'm uncomfortable making that assertion without seeing exactly what the optimizations have done.

Thanks,
Jonathan

jimdempseyatthecove · ‎12-07-2013

Consider (inside parallel region)

(I is private)

[fortran]
do I = omp_get_thread_num() + 1, yourUpper, omp_get_num_threads()
...
end do
[/fortran]

Jim Dempsey

View solution in original post

jimdempseyatthecove · ‎11-27-2013

When debugging with one thread, add a variable

integer, volatile :: iCopy

Then inside your loop add

iCopy = i

If you are debugging with multiple threads, then iCopy can be an array, then

iCopy[omp_get_thread_num()] = i

You can use a private variable to keep a copy of omp_get_thread_num()

Jim Dempsey

Jonathan_B_ · ‎12-02-2013

Thanks Jim,

The volatile variable helped, and will likely be very useful in the future. What it has revealed is that the optimized instructions ifort created do not check my loop bounds. Debugging statements printing off the value of my loop variable have been modified to print the expected value; the actual value (as reported by the volatile variable) does not change.

Is there a minimal expected number of loop iterations built into -O1, -O2, and -O3 which would remove the first loop bounds check?

Thanks,
Jonathan

jimdempseyatthecove · ‎12-02-2013

Fortran does not perform loops the same way that C/C++ performs loops.

In Fortran, the DO iteration space is examined at entry to DO to produce an iteration count (i.e. it becomes do the loop N times). From that point on the loop control variable might be a) not used, b) registerized, d) in the event of unrolling advance by unrolled count. At loop termination, the value of the loop control variable is the last one used (not next), or the initial setting should the loop not iterate.

If you manage to insert a break point into an optimized loop, the debugger should tell you the loop control variable is not available due to being registerized. If you are not seeing this message then either the debugger figured out the registerization or may be showing you the out of synch value of the non-registerized loop control variable.

Also note, modifying the loop control variable within the loop does not alter the iteration count.

And, inserting the iCopy=i in the loop may interfere with unrolling.

Jim Dempsey

TimP · ‎12-02-2013

jimdempseyatthecove wrote:

Fortran does not perform loops the same way that C/C++ performs loops.

In Fortran, the DO iteration space is examined at entry to DO to produce an iteration count (i.e. it becomes do the loop N times). From that point on the loop control variable might be a) not used, b) registerized, d) in the event of unrolling advance by unrolled count. At loop termination, the value of the loop control variable is the last one used (not next), or the initial setting should the loop not iterate.

If you manage to insert a break point into an optimized loop, the debugger should tell you the loop control variable is not available due to being registerized. If you are not seeing this message then either the debugger figured out the registerization or may be showing you the out of synch value of the non-registerized loop control variable.

Also note, modifying the loop control variable within the loop does not alter the iteration count.

And, inserting the iCopy=i in the loop may interfere with unrolling.

Jim Dempsey

In practice, the situation with C for loops is complicated enough that I've never seen it fully described. If you don't conform with optimizable patterns set by individual compilers, performance will suffer, or, with OpenMP, you don't get parallelization.

As Jim says, in Fortran, the number of iterations (for the case without EXIT or equivalent) is determined before entering the loop. Modifying the loop counter inside the loop is an error (since 1977). Some compilers may permit it in certain contexts (such as an EXIT block), as an extension. In C, with optimization, pre-calculated loop count may also be the case, but due to the standard not requiring it, there are more cases to deal with.

Contrary to what Jim said, a DO loop which terminates normally (not by EXIT...) will set the loop index variable to the next value, analogous to what you expect in C. Parallelization introduces possibilities in both Fortran and C for behavior to change; I've caught myself ignoring this problem.

jimdempseyatthecove · ‎12-02-2013

TimP,

From my: C:\Program Files (x86)\Intel\Composer XE 2011 SP1\Documentation\en_US\compiler_f\cl\index.htm

After termination, the DO variable retains its last value (the one it had when the iteration count was tested and found to be zero).

Is the document wrong?

Apparently so (there may be a compiler switch to alter this behavior)

[fortran]
DO I=1,3
      WRITE(*,*) I
ENDDO
WRITE(*,*) I
           1
           2
           3
           4
[/fortran]

Jim Dempsey

mecej4 · ‎12-02-2013

That "last value" is the value that the loop index variable had when the loop count was tested and found to be zero, and not the last value with which the body of the loop was actually executed.

jimdempseyatthecove · ‎12-03-2013

The ambiguity is:

Is the test made at the top or bottom of the loop? Or initial top, then subsequently at the bottom?
Is the loop control variable stride-stepped at the top (after initial test), or at the bottom before test?

Regardless, the document should be clear on what happens (and what happens should be consistent with standards when it addresses the issue).

Jim

TimP · ‎12-03-2013

Fortran standard is clearer on this point than the ifort document. At the time when the (f77) standard was adopted, compilers varied in whether they tested at the top or bottom, or even switched with optimization level. A compiler I used had 3 different treatments as side effects of other options, one of which conformed with f77.

It took quite a while for some compilers to comply with this, and I still don't count on it for cases involving parallelism, e.g. where the loop induction variable might need to be firstprivate or lastprivate (which aren't allowed).

Jonathan_B_ · ‎12-03-2013

Thank you all for helping to clarify the loop test. There are a few points I need to make in reference to the past posts.

jimdempseyatthecove wrote:
If you manage to insert a break point into an optimized loop, the debugger should tell you the loop control variable is not available due to being registerized. If you are not seeing this message then either the debugger figured out the registerization or may be showing you the out of synch value of the non-registerized loop control variable.

The variable is defined, but not allocated. I interpret this to mean that the variable is not being used at all. Copying the value to a volatile variable proved problematic, as the behavior of the program changed (-O0 started experiencing errors later in the code).

My problem is that it seems the loop count is not effectively being tested after the first iteration - it would not matter if the test was at the beginning or end of the code segment. The iteration count is 1 for this test case, and and it is continuing to iterations 2 and 3 before causing a segmentation fault when optimization levels -O1, -O2, or -O3 are used. Using -O0 results in functional (though obviously slower) code.

TimP (Intel) wrote:
As Jim says, in Fortran, the number of iterations (for the case without EXIT or equivalent) is determined before entering the loop.

No EXIT or equivalent is present. Since the number of iterations is dependent on run-time conditions (number of OpenMP threads available), this cannot be pre-calculated by the compiler. Is there a way to access the --actual-- iteration count and current iteration used in the assembly compare instruction? Either the number of iterations is not being calculated correctly at run-time, the current iteration is not correct, or the test is not being performed after completing all instructions corresponding to the code block of the loop (necessary regardless of the position of the test in assembly instructions). I suspect those are the three most likely reasons that the bounds of my do loop are not being respected.

Does this sound reasonable? If so, any suggestions for hunting down the cause? I very much suspect this will lead to a bug report, but since my program has dependencies of MKL, Intel Lapack and BLAS, and a separate sparse matrix solving library, I'd like to get all my ducks in a row to explain the issue. Otherwise, the appropriate development team would not have much to work on.

Thanks,
Jonathan

Jonathan_B_ · ‎12-03-2013

*Correction: -O1 is not behaving as nicely as -O2 and -O3 are.

Segmentation fault upon entering a single region:

Program received signal SIGSEGV
__kmp_enter_single () in /opt/apps/intel/13/composer_xe_2013.2.146/compiler/lib/intel64/libiomp5.so
(idb) backtrace
#0 0x00002af9cc38d71a in __kmp_enter_single () in /opt/apps/intel/13/composer_xe_2013.2.146/compiler/lib/intel64/libiomp5.so
#1 0x00002af9cc370e16 in __kmpc_single () in /opt/apps/intel/13/composer_xe_2013.2.146/compiler/lib/intel64/libiomp5.so

I thought this had been cleaned up by recompiling my sparse matrix solver library, but it turns out I was wrong. Still, I seriously doubt this issue is related to the original one, so I'll start a thread on it later if it still annoys me. The target optimization for the final program is -O3.

So, let's restrict optimizations considered for this to -O2 and -O3, which experience the same symptoms.

Jonathan

TimP · ‎12-03-2013

If you intend to run a DO loop over num_threads, would you not use something like

use omp_lib

!$omp parallel private(nt)

nt=omp_get_num_threads()

!$omp do

do i = 1,nt

....

end do

!$omp end parallel

Nearly everything in OpenMP depends on loop counts being calculated before entering the loop. So there's no reason here to rebel against the Fortran standard. In fact, if you have an OpenMP loop in which later iterations may have nothing to do, you typically need to let them spin without exit, e.g.

if(nomorework)cycle

Jonathan_B_ · ‎12-03-2013

This is an abbreviation of what I'm working with:

[fortran]integer, dimension(:), allocatable :: chunk
...
!$omp parallel shared(chunk)
!$omp master
allocate(chunk(omp_get_num_threads()+1))
...
! find appropriate divisions of sparse
! matrix for omp_get_num_threads()
! pieces
...
!$omp end master
!$omp end parallel

...

!$omp parallel shared(chunk, ...)
!$omp          private(i,j,k,...)
! allocate local result array
...
!$omp do schedule(static)
do i=1,size(chnk)-1
    do j=1,...
      do k=...
        localResult(j) = localResult(j) + Nonzero(k) * InputVector(k)
      end do
    end do

    !$omp critical
      do j=1,MatrixSize
        FinalResult(j) = FinalResult(j) + localResult(j)
      end do
    !$omp end critical
...
end do
!$omp end do
!$omp end parallel

deallocate(chunk)[/fortran]

If I'm forced to write N copies where N is the maximum number of threads per node I anticipate using, I can, but that's not really the cleanest code.

Jonathan_B_ · ‎12-03-2013

I should also add that this scheme has worked fine in the past - this error only popped up when I took advantage of Hermitian symmetry and added the reduction phase (critical section).

jimdempseyatthecove · ‎12-03-2013

Jonathan,

In the sketch code you showed, localResult would have to be private. This may be an omission in producing the code snip, but it could also be a coding oversight in your current code.

Also, the sketch code does not show how the thread unique "i"'s disambiguate the data. IOW is there some code between "do i" and "do j" that selects the stripe of data unique to each thread?

Jim Dempsey

Jonathan_B_ · ‎12-03-2013

Hi Jim,

You are correct, localResult must be private, and it is in my actual code; I paraphrased too quickly. And despite the fact that I edited the code example, the j loop edit did not make it into the post.

[fortran]do j=chunk(i)+1,chunk(i+1)[/fortran] are the bounds on the j loop. The chunk array is what disambiguates the data. Also, MatrixSize is a global parameter, Nonzero is a global allocated array (both of which are encapsulated in a module), and finalResult is shared.

Jonathan

jimdempseyatthecove · ‎12-04-2013

>>do j=chunk(i)+1,chunk(i+1)

Insert if(j.eq.chunk(i)+1) write(*,*) omp_thread_num(), chunk(i)+1,chunk(i+1)
(insert immediatly following do j=...)

Just to verify non-overlapping chunks

Jim Dempsey

Jonathan_B_ · ‎12-04-2013

Relevant output:

[plain] Time to load hamiltonian into memory: 3.0000000E-03
Chunk:
           0
          43
Size of Chunk:           2
Time to complete Lanczos iteration: 6.0999999E-03
Dimension of localResult:          43
Dimension of workingArray:         129
Pointer to input vector first element:          87
Pointer to output vector first element:          44
Bounds on loop var i:           1           1
Bounds on            1 th loop var j:           1          43
Bounds on            1 th loop var k:           1         339
Loop var analysis complete.
Made it into i loop. Variable i=           1
Zeroed working array.
           0           1          43
Made it into j loop. Variable j=           1
Made it into k loop. Variable k=           1
Made it into k loop. Variable k=           2
Made it into k loop. Variable k=           3
Made it into k loop. Variable k=           4
Made it into k loop. Variable k=           5
Made it into j loop. Variable j=           2
Made it into k loop. Variable k=           6
Made it into k loop. Variable k=           7
Made it into k loop. Variable k=           8
Made it into k loop. Variable k=           9
...
Made it into j loop. Variable j=          43
Made it into k loop. Variable k=         339
Reducing matrix-vector product.
Reduction complete.
Made it into i loop. Variable i=           2
Zeroed working array.
Reducing matrix-vector product.
Reduction complete.
Made it into i loop. Variable i=           3
Zeroed working array.
           0 -860295351          56
Made it into j loop. Variable j= -860295351
forrtl: severe (174): SIGSEGV, segmentation fault occurred[/plain]

I just read the section on do constructs in the Fortran 95 spec. I'm not doing anything that is nonstandard. The iteration count can be calculated during runtime prior to entry into the loop block. I even modified the code by using

[fortran]do i=1,omp_get_num_threads()[/fortran]

which produced identical results.

Jonathan_B_ · ‎12-04-2013

But disturbingly, I just logged into the server again and this output occurred:

[plain] Time to load hamiltonian into memory: 2.4999999E-03
Chunk:
           0
          43
Size of Chunk:           2
Time to complete Lanczos iteration in ARPACK: 6.0000003E-04
Dimension of prodPart:          43
Dimension of workd:         129
Pointer to input vector first element:          87
Pointer to output vector first element:          44
Bounds on loop var i:           1           1
Bounds on            1 th loop var j:           1          43
Bounds on            1 th loop var k:           1         339
Loop var analysis complete.
Made it into i loop. Variable i=           1
Zeroed working array.
           0           1          43
Made it into j loop. Variable j=           1
Made it into k loop. Variable k=           1
Made it into k loop. Variable k=           2
Made it into k loop. Variable k=           3
Made it into k loop. Variable k=           4
Made it into k loop. Variable k=           5
Made it into j loop. Variable j=           2
Made it into k loop. Variable k=           6
Made it into k loop. Variable k=           7
Made it into k loop. Variable k=           8
Made it into k loop. Variable k=           9
Made it into k loop. Variable k=          10
...
Made it into j loop. Variable j=          43
Made it into k loop. Variable k=         339
Reducing matrix-vector product.
Reduction complete.
Made it into i loop. Variable i=           2
Zeroed working array.
           0          44   146337608
Made it into j loop. Variable j=          44
Made it into j loop. Variable j=          45
Made it into j loop. Variable j=          46
Made it into k loop. Variable k=           0
Made it into k loop. Variable k=           1
Made it into k loop. Variable k=           2
Made it into k loop. Variable k=           3
Made it into k loop. Variable k=           4
Made it into k loop. Variable k=           5
Made it into k loop. Variable k=           6
Made it into k loop. Variable k=           7
Made it into k loop. Variable k=           8
Made it into k loop. Variable k=           9
Made it into k loop. Variable k=          10
...
Made it into k loop. Variable k=         343
Made it into k loop. Variable k=         344
Made it into k loop. Variable k=         345
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source
diagonalize_8      0000000000411360 Unknown               Unknown Unknown
libiomp5.so        00002B5ABEFCDFE3 Unknown               Unknown Unknown[/plain]

I have no idea why the behavior would vary like this, but at least the continuation of the loop (i=2) is consistent with the previous output.

Jonathan

Jonathan_B_ · ‎12-06-2013

It turns out that the difference in behavior is because I was testing on a different login node. To the best of my knowledge, the environment is uniform across all nodes in the system, but I'm verifying that with the sysadmins. Still, $OMP_NUM_THREADS is set to 1 for all login nodes, so my executable should not vary due to system load, right?

Anyone know of conditions in the OpenMP library that would cause an executable using only one thread to vary between two systems with identical hardware and software?

Thanks,
Jonathan

jimdempseyatthecove · ‎12-06-2013

In your printout it lists:

Bounds on loop var i: 1 1

Your do i=1,omp_thread_num() is producing an i larger than the upper bound of the i index in your array.

Therefore you will requires immediately following the do i=1,omp_thread_num() a statement like

if(i .gt. ubound(...)) exit

Where ... is replaced by the proper reference to obtain the bounds of the array indexed by i (same way as you obtained bounds for above report).

If your i bounds will be small in your production version (IOW smaller than thread count), then consider moving the parallization inwards.

Jim Dempsey