writing all bits to output file?

Ralph_Nelson · ‎04-10-2012

Working with a nonlinear problem and restart doesn't yield same answer as does an uninterrupted run. This can be typical of nonlinear problems and I wanted to check to see how to write restart info to an output file to ensure I'm getting all the bits of the stopping point.

I've tried
write(unit, *) variable

Is that correct?

Is there a reasonable way to do a memory dump at the ending time step and restart from that?

If that is correct way then I have another problem.

Thanks.

mecej4 · ‎04-10-2012

If you want to restart with all bits intacts, a better approach would be to use UNFORMATTED files, i.e,

READ/WRITE (UNIT=nn)I/O_list

If you use formatted I/O, conversion between IEEE-binary and decimal is not fully reversible.

TimP · ‎04-10-2012

As mecej4 said, unformatted files are a likely solution. If you use formatted, you must specify a format big enough, such as g15.9 for single, or g25.17 for double, and this will be even slower than freem list-directedformat.
You will likely want to set buffered_io, if you don't need to see the data before buffer flushes.

Ralph_Nelson · ‎04-10-2012

Thanks for the replies folks. How does READ/WRITE (UNIT=nn) I/O_list differ from READ/WRITE (UNIT=nn,*)I/O_list?

What about a complete memory dump, saving worrying about if you've saved all the needed info? An easy way to do that?

Paul_Curtis · ‎04-10-2012

"What about a complete memory dump, saving worrying about if you've saved all the needed info? An easy way to do that?"

Yes, the easiest way is to use the Win32 API functions (ie, ReadFile(), WriteFile()), which enable transfer of a defined number of bytes starting from a defined address (ie, the beginning of your array), and can be used for reading as well as writing so you can re-initialize your internal structures with the exact content as when they were saved. This is a direct memory transfer operation, completely unlinked to any Fortran variable definitions, and there is no hidden or implicit internal processing.

Ralph_Nelson · ‎04-10-2012

Thanks Paul.

I would think Win32 API functions would not be universally applicable to other OS configurations? I need a general capability if one exists.

IanH · ‎04-10-2012

Unformatted output (the file is opened with the FORM='UNFORMATTED' specifier in the OPEN statement, subsequent READ or WRITE statements do not have a format specifier) is as close as you can get using standard Fortran to a "memory dump" for a list of variables. Unformatted here means that the representation of the data in the file will not be formatted in a way that humans can easily comprehend - it is just whatever bits the compiler needs to store the values in the data.

The * in WRITE (unit,*) list is a special form of format specifier (hence it is only used with formatted IO). It means that the formatting of the items in list is determined by the type of the items in list (hence so called list-directed formatting), with the details of that formatting left up to the compiler. As it is formatted IO the representation in the file (or on the screen, or whatever) should be easy enough for a human to comprehend.

(Other forms of format specifer include the classic reference to the label of a format statement or a character expression that is equivalent to the content of a format statement.)

For your needs unformatted IO seems to me to be a very good fit.

Ralph_Nelson · ‎04-10-2012

Thanks for the clarification lanH. Looks like unformated I/O it is.

JVanB · ‎04-10-2012

Yeah, but Win32 API functions are not required to perform this task. You can use C_LOC to get the address of your data, then C_F_POINTER to create a Fortran pointer that can access all the data, the stream I/O to spit out/suck up the data. The trouble is that you don't know for sure how all your data is laid out in memory.

[bash]module mykinds implicit none integer, parameter :: dp = kind([DOUBLE PRECISION::]) end module mykinds recursive subroutine dumpme(way,data) use mykinds use,intrinsic :: ISO_C_BINDING implicit none integer way real(dp) data real(dp), target :: item1(4) integer item2 type(C_PTR) data_start integer(C_INT8_T), pointer :: data_image(:) integer lun character(*), parameter :: filename = 'data_dump.bin' if(way == 1) then item1(1) = data*4*atan(1.0_dp) item1(2) = sqrt(abs(data)) item1(3) = transfer("Hello, w",data) item1(4) = transfer("orld. ",data) item2 = data**2 data_start = C_LOC(item1) call C_F_POINTER(data_start,data_image,[4*8+4]) open(file=filename,newunit=lun, & access='stream',status='replace') write(lun) data_image write(*,'(a)') 'Data written to '//filename write(*,'(a,g24.17)') 'item1(1) = ', item1(1) write(*,'(a,g24.17)') 'item1(2) = ', item1(2) write(*,'(17a)') 'item1(3:4) = ', & transfer(item1(3:4),['A']) write(*,'(a,i0)') 'item2 = ', item2 else data_start = C_LOC(item1) call C_F_POINTER(data_start,data_image,[4*8+4]) open(file=filename,newunit=lun, & access='stream',status='old') read(lun) data_image write(*,'(a)') 'Data read from '//filename write(*,'(a,g24.17)') 'item1(1) = ', item1(1) write(*,'(a,g24.17)') 'item1(2) = ', item1(2) ! write(*,'(a)') 'item1(3:4) = '// & ! transfer(item1(3:4),repeat('A',16)) write(*,'(17a)') 'item1(3:4) = ', & transfer(item1(3:4),['A']) write(*,'(a,i0)') 'item2 = ', item2 end if end subroutine dumpme program test use mykinds implicit none real(dp) data integer way write(*,'(a)',advance='no') 'Enter the REAL data:> ' read(*,*) data write(*,'(a)',advance='no') 'Enter 1 to write, other to read:> ' read(*,*) way call dumpme(way,data) end program test [/bash]
On the first run we write data to data_dump.bin.

[bash]Enter the REAL data:> 4.789 Enter 1 to write, other to read:> 1 Data written to data_dump.bin item1(1) = 15.045087218041518 item1(2) = 2.1883783950679097 item1(3:4) = Hello, world. item2 = 22[/bash]
On the second run we read it back out.

[bash]Enter the REAL data:> 17.22 Enter 1 to write, other to read:> 0 Data read from data_dump.bin item1(1) = 15.045087218041518 item1(2) = 2.1883783950679097 item1(3:4) = Hello, world. item2 = 0[/bash]
Notice that item2 was not recovered correctly because we weren't guaranteed that it followed item1 in memory. BTW the above was with gfortran, you may get different results with ifort.

IanH · ‎04-10-2012

There are requirements on the arguments to C_F_POINTER that make this approach non-conforming (though I guess likely to still "work", given other platform assumptions). I'm not sure I see the point though - if you just want a stream of bytes then why not just write each item out using stream IO?

(For the OP's situation I'd be tempted to stick with record based IO - as a partial check against the writing and reading procedures getting out of whack.

In other situations there might be merit in using the Windows API directly (to write contiguous chunks of data, this excludes things like array and pointer descriptors that won't have meaning across (or sometimes even within) invocations of a program) if you needed the additional control over the nature of the IO that the lower level API gives you. But in that case you are into platform specific territory anyway.)

Ralph_Nelson · ‎04-11-2012

"I'm not sure I see the point though"

The point is that with large codes with many variables the number of variables needed to restart can be error prone. If you forget a flag or some other quantity that the simulation depends on, the restart can be inaccurate. Tracing all these variables is difficult particularly later in a code's lifecycle.

Thus a core dump you read back in to restart the simulation would be nice. But I realize I've not seen this general capability in a long time. I figured the general cross platform didn't exist but it never hurts to ask.

jimdempseyatthecove · ‎04-12-2012

>>Thus a core dump you read back in to restart the simulation would be nice.

This may not be sufficient since the application state may also be held in the operating system. In particular open files, event handles etc... These application states would not be restored upon reload of core image.

Additionally, the open file states (contents of files) would have to be restored as well for complete restoration.

Creating a program checkpoint facillity (that correctly operates) is usually non-trivial. I know, as I have written several.

Jim Dempsey

Ralph_Nelson · ‎04-12-2012

Jim, I do apprecriate your thoughts. The points you raise are certainly valid but for me are easier to determine than a large number of variables that might be needed. And yes, I've written or been involved with a number of codes that use restarts (checkpoints) and is why I was curious of the possibility.

With the use of unformatted dumps, writes and reads as suggested earlier, of the needed transient variables I've been able to do the exact restart needed.

Thanks to all!!

John_Campbell · ‎04-13-2012

I think you are now getting closer to the main point, as the difference between unformatted I/O, stream I/O or API routines is not the issue. You must save all the key in-memory database variables and arrays for a correct restart. This will include any non-linear history that needs to be retained for future response calculation. Your idea of a core dump appears to be a lazy approach, as you should planfor restartso that all the key data structures are being either re-calculated or restored from the backup file. If you are introducing a restart capability, you should have access to this information.

John

jimdempseyatthecove · ‎04-13-2012

I would like to add to John's comments:

In the simulation work I do (Space Elevators, tethered satellite systems, etc...), simulations can run a very long time (weeks). Checkpoints are made periodically with the understanding that the simulation will often crash. When it does crash, you (I) want to roll back the simulation state and then resume while looking at the conditions (in finer step detail) that led up to the crash. Additionally, should it be discovered that the cause was due to a repairable condition that could be installed either by code change or by state change, then the change can be applied and the run resumed at the checkpoint (with the changes in place) thus saving a week of run time. An example of this is altering the time or duration or impulse of a thruster application or, time or rate of tether deployment or retrieval (and many other dynamic control events). A properly designed checkpoint system can save you significant time.

Jim Dempsey

Ralph_Nelson · ‎04-13-2012

Jim and John,

I do greatly apprecriate your comments.

One is always faced with the problem of posing a suseqent question in forums so I haven't include some details related to my original question in this thread. Thus, no way you could know the following info since I've not stated it.

I'm trying to breath life back into a code written in the early 1990s for Cray supercomputers by one of my post-docs. Reason being the subject interests me even though I'm retired.

Thus my chance to devise a well-constructed checkpoint is long past by a couple decades. Add to that one particular simulation displayed non-stationary (potentially chaotic) behavior and took 35hrs of Cray time to run thus restarts were a necessity. Yet another factor is that the Cray word length was different than today's machines (assuming double precision in the Intel compiler) so the bifurcations for period doubling are further changed. Changes due to integer and real word lengths I can accept but those dealing with checkpoints lacking the complete variable state due to inadquate precision are not. All that lead to the original question here.

And yes there were machines in the past where you could do core dumps and restarts for certain situations. That's a dead horse on today's machines so let's drop the subject.

One final note: it has been several years since I've coded so I'm near to starting over on the learning curve so some of my questions on this forum will reflect that. Not an excuse, just the way it is.

Again, I greatly apprecriate your thoughts and suggestions. Hopefully I've not said anything in a offensive manner in this post. If so I apologize.

-ralph nelson

TimP · ‎04-13-2012

Interest in automatic check-pointing (e.g. BLCR) is returning, as bigger and faster systems are making bigger jobs economically feasible while introducing more points of failure.