- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
We have a strange I/O problem and are hoping someone out there has information that will help us. Unfortunately, we havent succeeded in making a small reproducer yet. We are hoping that someone has an idea on what the problem really is, which would help us make a small reproducer to test/debug with. Ideally, we would like to produce a small reproducer which could be posted to this forum for feedback and as a reference to others. We wonder if we are encountering a compiler / system bug (and are wondering if anyone else has experience similar issues), but dont feel comfortable concluding this is the case until we can present a small reproducer.
Synopsis of problem: Data written does not match data that is read later. In one particular instance, the data that was read later was missing a contiguous chunk of data from the middle of what should have been written, but the length of the dropped segment did not seem to be any auspicious number (not a multiple of 512 or anything.) In another case, the data was written to the wrong location. These IO problems are rare (only happen on a few test cases) but this represents a serious issue as the program results must be reliable. Unfortunately the affected program is very large and has proprietary elements, so it is not suitable for posting in the aggregate.
Some excerpts of the IO-related code are listed at the end, for those of you who want to jump there.
Operating systems: Red Hat and SUSE
Fortran compilers: Intel versions 10.1 and 11.1
Program characteristics that I suspect you experts care about:
1. Parallel program using MPI (MPICH 2 or SGI MPT, problem was recreated with both). Would this introduce additional complexity (multithreading, etc.) that the ifort compiler does not expect?
2. Each process has its own dedicated OOC file (so parallel access should not be an issue).
3. (Perhaps this is key to problem just a hunch) reads and writes occur from contained routines within recursively called subroutines.
4. It happens most often when the interior of the file is written to, but has also (very rarely) happened when the file is only being appended to. It happens much more often when data is written to the same location twice in rapid succession (which we did while trying to make a smaller reproducer).
5. We do not have this problem when we use C for the IO instead. (We were hoping to stay with pure Fortran, but may have to use C too if we cant solve this).
Other Observations: Making small changes related to I/O changes the instantiation of the bug. For example, adding a flush after the write eliminated the problem from some test cases, but not from others. Changing the BUFFERCOUNT in the file open() routine, either to a smaller or larger integer, can mask or unmask this problem.
=== IO-related code excerpts ===
Example of open
open(OOCUnit, FILE=OOCFileName, FORM='UNFORMATTED', &
ACCESS='STREAM', ACTION='READWRITE', STATUS='OLD', &
BUFFERED='YES', BUFFERCOUNT=BufferCount)
Example of write
write(OOCUnit, POS=OOCPos) ContiguousMData
Example of read
read(OOCUnit, POS=OOCPos) F%Matrix%MData
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Parallel access within the same process may be an issue.
In addition to MPI, are you parallel programming each process? If so, you may need a critical section around your write data blob to file. (Not just a critical section aroung the WRITE, since multiple WRITEs may be required to output each blob.)
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>
Example of write
write(OOCUnit, POS=OOCPos) ContiguousMData
Example of read
read(OOCUnit, POS=OOCPos) F%Matrix%MData
<<
Can you eliminate the POS=OOCPos?
If not, then insert diagnostic sanity checks to assert OOCPos is correct
i.e. are the sequence of the OOCPos on writes the same as for reads?
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Add diagnostic code to verify the assumption "thus it should be correct".
When (if) you find something incorrect, then this may point to a programming problem or a bug.
Note, if I understand what you are trying to do, each MPI process will inquire(iolength...) for the record sizes of the other MPI processes (at least those which may have data preceeding the current MPI process's data in the file). Also pay attention to the value of the IOLENGTH unit size (may be 4 bytes or 1 byte or??). All processes must be using the same IOLENGTH unit size values. And RECL= on the OPEN may interfere with the position assumption made with inquire(iolength...
For a formatted file, the file storage unit is an eight-bit byte. For an unformatted file, the file storage unit is an eight-bit byte (if option assume byterecl is specified) or a 32-bit word (if option assume nobyterecl, the default, is specified).
Depending on other factors, assert that all MPI processes are using the same POS unit size.
Jim

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page