DO loop bug?

dajum · ‎03-22-2009

The following line causes lots of problems when using 11.0.066 (64 bit version). Using 11.0.061 it seems to work fine. It seems like it should generate an error in all versions but doesn't.

DO 10, JTEST=1001,1872

It seems like the comma should not be there after the 10. I have a program, that goes off and corrupts a lot of memory with the comma in there as the loop is processed. adding write statements or changing the code in almost any way changes the problem, and most of the times it doesn't seem to generate anything incorrectly if I modify my original code. Elements computed in the loop get nonsense numbers stored in their locations. Is this bad input? Why doesn't it generate any warnings or errors? Run it with 9.0 or 11.0.061 and the code runs fine. Run it in the debugger and it also runs fine. It just seems optimized code run incorrectly.So my question is really is the code with a comma wrong?

Dave

Les_Neilson · ‎03-23-2009

A quick look at the help shows :

[cpp]A DO construct takes one of the following forms:
Syntax
Block Form:

[name:] DO [label [, ] ] [loop-control]
     block
[label] term-stmt





So the comma after the label is optional and therefore permissable - your code, as shown, is correct.
If you can provide a small example which shows the problem you should report it to Premier Support.
Les
[/cpp]

Brian_Francis · ‎03-23-2009

Hi Dave,

Try compiling the file with and without the "," but with assembly output enabled (see the /asmattr compiler option). If the assembly output is the same, then you've most likely got a memory overwrite issue caused by the use of uninitialized data.

Good luck (sounds like you need some),

Brian.

gib · ‎03-23-2009

Quoting - Brian Francis

Hi Dave,

Try compiling the file with and without the "," but with assembly output enabled (see the /asmattr compiler option). If the assembly output is the same, then you've most likely got a memory overwrite issue caused by the use of uninitialized data.

Good luck (sounds like you need some),

Brian.

The first debugging step is to turn on run-time bounds checking.

dajum · ‎03-23-2009

Thanks for the suggestions. I did try the assembly output comparision, but it shows them being the same. I don't know the impact of turning on that output, but I suspect it may change optimizationswitches since by writing out the assembly the problem goes away. I can't use bounds checking because this is built with a library. All the common blocks compiled in the library use an array subscript of (1). With the true size being set in the main routine which is compiled and linked with the library. So just about every array reference is seen as out of bounds. It would be nice if the error checking was advanced enough to realize that this feature of the language is being used, but so far I haven't seen any way to get it to use linked sizes of common blocks that store an array.

I also do not believe it is unitialized data. The routine does compute the array elements in it, but the values stored are not the same as those computed ina subfunction called in that routine. I can write out the value of the computed values and they are correct in the sub function, but the value stored is corrupted.

I believe this is a compiler bug that is similar to one I found in 10.1 and was fixed in 11.0. I got around it by using
the switch -mP2OPT_hlo_distribution=0. But that time I was able to submit a sample of the problem. This time I'm afraid I will not be able to submit it because the proprietary nature of the code and data. Trying to reduce the code, has resulted in the problem going away as well. That puts me in a no win situation since it appears the bug may be related to the size of the source, but I'm just guessing.

Thanks,

Dave

gib · ‎03-23-2009

Quoting - dajum

Thanks for the suggestions. I did try the assembly output comparision, but it shows them being the same. I don't know the impact of turning on that output, but I suspect it may change optimizationswitches since by writing out the assembly the problem goes away. I can't use bounds checking because this is built with a library. All the common blocks compiled in the library use an array subscript of (1). With the true size being set in the main routine which is compiled and linked with the library. So just about every array reference is seen as out of bounds. It would be nice if the error checking was advanced enough to realize that this feature of the language is being used, but so far I haven't seen any way to get it to use linked sizes of common blocks that store an array.

I also do not believe it is unitialized data. The routine does compute the array elements in it, but the values stored are not the same as those computed ina subfunction called in that routine. I can write out the value of the computed values and they are correct in the sub function, but the value stored is corrupted.

I believe this is a compiler bug that is similar to one I found in 10.1 and was fixed in 11.0. I got around it by using
the switch -mP2OPT_hlo_distribution=0. But that time I was able to submit a sample of the problem. This time I'm afraid I will not be able to submit it because the proprietary nature of the code and data. Trying to reduce the code, has resulted in the problem going away as well. That puts me in a no win situation since it appears the bug may be related to the size of the source, but I'm just guessing.

Thanks,

Dave

As you are probably aware, bugs caused by memory access violations, either through bounds being exceeded or through stack corruption, typically appear and disappear randomly when code changes are made.

Do you have the ability to recompile the library code?

dajum · ‎03-24-2009

Quoting - gib

As you are probably aware, bugs caused by memory access violations, either through bounds being exceeded or through stack corruption, typically appear and disappear randomly when code changes are made.

Do you have the ability to recompile the library code?

I am also aware that compiler bugs randomly appear and disappear when code changes are made. That was the case with the last bug I identified.

Yes I can recompile the library, but that isn't necessary. I can write the value that is being stored in the array in the sub function where it is computed. It is passed back as the function returnvalue and stored. The subelements of the array are the loop index. The value stored isn't what is written out in the sub function. But it writes into every value it should, just the wrong number. It's a compiler bug. I just can't send it in with the entire program, and it doesn't seem to be there when I reduce the problem.

Dave

jimdempseyatthecove · ‎03-25-2009

Dave,

A couple of things come to mind, you will have to do some detective work as I am not present at your site.

1) The data size or structure alignment/packing is different between your code usage and what the library expects. If structures involved you may be required to use the SEQUENCE statement in your type declaration. Sizes would be an issue for size of integer or real.

2) If you are not using the same calling convention, in particular who is responsible for stack cleanup, you may experience such problems.

For 2), this is NOT a fix, rather a diagnostic.

For all the arguments passed into the library, where the argument is declared in the caller's subroutine, temporarily add the SAVE attribute. If this causes the junk'd up data to appear OK then this would indicate that the calling convention is incorrect. Do NOT ASSUME that you fixed the code, you haven't. Remove the temporary SAVE attribute and fix the calling convention. Not fixing the calling convention will cause a harder to locate problem later on.

Jim Dempsey

gib · ‎03-25-2009

Quoting - dajum

I am also aware that compiler bugs randomly appear and disappear when code changes are made. That was the case with the last bug I identified.

Yes I can recompile the library, but that isn't necessary. I can write the value that is being stored in the array in the sub function where it is computed. It is passed back as the function returnvalue and stored. The subelements of the array are the loop index. The value stored isn't what is written out in the sub function. But it writes into every value it should, just the wrong number. It's a compiler bug. I just can't send it in with the entire program, and it doesn't seem to be there when I reduce the problem.

Dave

If you are right, and it is a compiler bug, then nothing I say will be useful. But in my experience what I think are compiler bugs almost always turn out to be errors in the code (note I did say "almost").

I'm having trouple understanding your description of what is happening. Can you please provide some code (or pseudo-code) that explains how the function is called, how the array elements are defined, and what values you are writing out etc.

Brian_Francis · ‎03-26-2009

Quoting - dajum

snip...

I also do not believe it is unitialized data. The routine does compute the array elements in it, but the values stored are not the same as those computed ina subfunction called in that routine. I can write out the value of the computed values and they are correct in the sub function, but the value stored is corrupted.

snip...

Your mention of common blocks and corrupted values got me thinking...

You should check that you are not passing a variable that is a member of a common block to a function or subprogram that also includes that common block. You will end up with an unintended (or intended) equivalence that is bound tocause problems unless you declarethe argumentas VOLATILE.

I'm looking forward to hearing the solution on this issue.

Brian.

Edit: Jim makes a valid point below. You must declare both the argument and the common block (or a portion of it) VOLATILE. Btw, I don't condone this practice, but sometimes you're given a monolithic piece of code to maintain, and its simpler to roll with it rather than rewrite it.

jimdempseyatthecove · ‎03-27-2009

Quoting - Brian Francis

Quoting - dajum

snip...

I also do not believe it is unitialized data. The routine does compute the array elements in it, but the values stored are not the same as those computed ina subfunction called in that routine. I can write out the value of the computed values and they are correct in the sub function, but the value stored is corrupted.

snip...

Your mention of common blocks and corrupted values got me thinking...

You should check that you are not passing a variable that is a member of a common block to a function or subprogram that also includes that common block. You will end up with an unintended (or intended) equivalence that is bound tocause problems unless you declarethe argumentas VOLATILE.

I'm looking forward to hearing the solution on this issue.

Brian.

Under that circumstance, even if the variable(s) is(are) declared as VOLATILE the compiler will not know that they are aliases.

call foo(TEMP)
...
subroutine foo(A)
...
TEMP = something
(now use A)

Where TEMP is is COMMON or MODULE in scope of both caller and callee.

Jim Dempsey

dajum · ‎03-27-2009

I don't think any of the above problems apply. Here is the offending code:

IF ( XK( 132) .EQ. 0.0000000 ) THEN
SNAME = 'ENVMNT'
CALL CRYTRN(SNAME, 200, ITEST)
FNAME = TRIM(UCA(ITEST))
SNAME = 'US_BODY'
DO 10, JTEST = 1001,1872
ETEST = GET_AREA(FNAME,SNAME,JTEST)
CALL NUMTRN(SNAME, JTEST, ITEST)
XK(ITEST) = ETEST
10 CONTINUE
XK( 132) = 1.0000000
ENDIF
CALL QVTIME('US_BODY ')
CALL GVTIME('US_BODY ')
CALL CVTIME('US_BODY ')
RETURN
END

This is the entire code in the routine. The variables are all declared and are all in common blocks. The subroutines and functions called do not contain the same common blocks. The arguments passed to the routines are not modified in any of the routines with the exception of ITEST, which is computed in the routine NUMTRN. But writing out values in the called subroutines and functions show the return value of the functions and ITEST index computed are all correct. But the values stored in XK are not the same as the returned values, and in all cases are garbage. Some very small (e-4) and other large (E26), where the computed values are about 400.-500. Removing the comma from after the label 10 in the DO loop makes the code run correctly. Adding a write statement in the loop also corrects the problem.

As I said I had a similar error in version 10. that at DO loop was executed and extra 5-20 times, but only for certain values of the indexes. This bug was fixed in Version 11.

Thanks,

Dave

gib · ‎03-28-2009

Quoting - dajum

I don't think any of the above problems apply. Here is the offending code:

IF ( XK( 132) .EQ. 0.0000000 ) THEN
SNAME = 'ENVMNT'
CALL CRYTRN(SNAME, 200, ITEST)
FNAME = TRIM(UCA(ITEST))
SNAME = 'US_BODY'
DO 10, JTEST = 1001,1872
ETEST = GET_AREA(FNAME,SNAME,JTEST)
CALL NUMTRN(SNAME, JTEST, ITEST)
XK(ITEST) = ETEST
10 CONTINUE
XK( 132) = 1.0000000
ENDIF
CALL QVTIME('US_BODY ')
CALL GVTIME('US_BODY ')
CALL CVTIME('US_BODY ')
RETURN
END

This is the entire code in the routine. The variables are all declared and are all in common blocks. The subroutines and functions called do not contain the same common blocks. The arguments passed to the routines are not modified in any of the routines with the exception of ITEST, which is computed in the routine NUMTRN. But writing out values in the called subroutines and functions show the return value of the functions and ITEST index computed are all correct. But the values stored in XK are not the same as the returned values, and in all cases are garbage. Some very small (e-4) and other large (E26), where the computed values are about 400.-500. Removing the comma from after the label 10 in the DO loop makes the code run correctly. Adding a write statement in the loop also corrects the problem.

As I said I had a similar error in version 10. that at DO loop was executed and extra 5-20 times, but only for certain values of the indexes. This bug was fixed in Version 11.

Thanks,

Dave

Have you tried modifying GET_AREA() or NUMTRN to make them do something trivial, or replacing the library calls with calls to local subprograms? These are the kinds of debugging steps I would take.

dajum · ‎04-01-2009

What I'm really puzzled about, and nobody seems to have commented on is what difference a comma in a DO LOOP should have on the code produced and the execution. It obviously produces different code when optimized, with and without the comma, but why? Nothing else changes so why would it produce different code? I could track down all other problems listed, but they don't happen. The compiler is doing something bad with the comma in place, but not when it is missing. Why should that matter? It shouldn't change stack dependenciesor anything else.

Brian_Francis · ‎04-02-2009

Quoting - dajum

What I'm really puzzled about, and nobody seems to have commented on is what difference a comma in a DO LOOP should have on the code produced and the execution. It obviously produces different code when optimized, with and without the comma, but why? Nothing else changes so why would it produce different code? I could track down all other problems listed, but they don't happen. The compiler is doing something bad with the comma in place, but not when it is missing. Why should that matter? It shouldn't change stack dependenciesor anything else.

Dave,

It's not that we don't believe you, but almost every time I've gotten tothe eureka! moment with a problem such as yours, I've found it to be pilot error and not compiler error. Not that there aren't compiler errors -- a lot of topics on this forum attest to that -- but most of us can't do anything about that, so we've been pulling things out of our "been there, done that" box in the hopes that it will help you.

Just be thankful you've got a simple solution: get rid of that pesky comma!

:)

Brian.

Kevin_D_Intel · ‎04-03-2009

For code snippet provided earlier and defining all required variables COMMON, I see no assembly differences with or without a comma at optimization (levels: /O2, /O3)using the 11.0 compiler. There are also no differences with/without the comma between 11.0.061 and 066. Maybe I did not match all the options you used?

As Les posted earlier, the comma is optional and the compiler appears to treat the DO construct and label accordingly with or without the comma.

Generating the assembly listing suppresses creation of an object file so generation of the assembly listing could not have led the program to work. It also does not suppress any optimizations. The generated asm is reflective of the optimized code. A previous object file must have existed and was used during your earlier testing.

gib · ‎04-03-2009

Quoting - Brian Francis

Dave,

It's not that we don't believe you, but almost every time I've gotten tothe eureka! moment with a problem such as yours, I've found it to be pilot error and not compiler error. Not that there aren't compiler errors -- a lot of topics on this forum attest to that -- but most of us can't do anything about that, so we've been pulling things out of our "been there, done that" box in the hopes that it will help you.

Just be thankful you've got a simple solution: get rid of that pesky comma!

:)

Brian.

Getting rid of the comma would solve the immediate problem, but might leave a hidden cause lurking, waiting to leap out and bite someone when they least expect it.

Brian_Francis · ‎04-03-2009

Quoting - gib

Quoting - Brian Francis

snip...

Just be thankful you've got a simple solution: get rid of that pesky comma!

Getting rid of the comma would solve the immediate problem, but might leave a hidden cause lurking, waiting to leap out and bite someone when they least expect it.

Good point. What would you have Dave do?

gib · ‎04-03-2009

Quoting - Brian Francis

Good point. What would you have Dave do?

I wouldn't presume to tell him what to do. In his situation I'd pursue all debugging options (he probably feels that he has done this) and try to provide buggy sample code to the Intel team.