Severe(59) when reading a file ANSI as UTF-8

Sebastiano_P_ · ‎09-01-2017

Hallo,

my code reads a very easy file such as

1.000 2.000 3.000
11.000 12.000 13.000

However when the file is edited by third parties a severe (59) error often occurs. This regularly happen when the file to be read is not in the UTF-8 format but in other formats as e.g., ANSI as UTF-8

Is there a way to read the file irrespective of its format? At the moment the read command is simply

OPEN(unit=1,file='filename.txt',status='old',action='read')
READ(1,*) (upper(i),i=1,n_par);
READ(1,*) (lower(i),i=1,n_par);
CLOSE(1)

Thank you.

PS: THE ORIGINAL PROBLEM REPORT IS INCORRECT AND MISLEADING. Please, note that there is a mistake in my original question above. The code worked properly when reading ANSI formatted files and did not work when reading UTF-8 formatted files.

mecej4 · ‎09-01-2017

There are almost no easy-to-use facilities for reading arbitrary encodings such as UTF-8 and multi-byte character data in Fortran (at least not yet). You have to decide in advance what type of files are to be read, and write your program to check the format of the file (e.g., if it is UTF-8 with BOM) and then read the file accordingly.

My recommendation is that you use utility programs to convert your data files to ASCII/ANSI, process those converted files in your Fortran programs, and use another utility to convert the output back to UTF-8, etc., as needed.

If you allow third parties to edit your files and mangle them, you will have the same kind of problems that some beginners run into when they use a word processor to create Fortran source files and find that the compiler is unable to digest those source files.

Sebastiano_P_ · ‎09-01-2017

Thank you for the fast reply.
At the moment the code reads UTF-8 files but not ASCII/ANSI files. Actually, this is the first time I encounter this problem. I've always written my input files using an easy text editor (notepad++). Do you have any idea of where I could find a good tutorial for fixing the problem or to have an overview of the different ways to read files? I've always simply coded this part as above.

Thank you.

gib · ‎09-01-2017

I'm puzzled. I created a text file zzz.txt in notepad++ (ANSI encoding) with these lines:

1.00 2.00 3.00
10.0 11.0 12.0

and read it with this program:

program main
implicit none

character*(32) :: fname
integer :: n=3
real :: x(10)

write(*,*) 'Enter file name:'
read(*,'(a)') fname
open(10,file=fname,status='old')

read(10,*) x(1:n)
write(*,*) x(1:n)
read(10,*) x(1:n)
write(*,*) x(1:n)

close(10)

end program main

I then opened zzz.txt with notepad++, changed the encoding to "Encode in UTF-8", and saved it as zzz-utf8.txt. When I run the program and I try to read this file, I get an error:

D:\Fortran>test
Enter file name:
zzz-utf8.txt
forrtl: severe (59): list-directed I/O syntax error, unit 10, file D:\Fortran\zzz-utf8.txt

It looks to me as if you have it back-to-front - you usually create ANSI text files, but sometimes other people change the encoding to UTF-8, which can't be read. Just as mecej4 assumed.

mecej4 · ‎09-01-2017

From what you have said, I see no reason why you should do anything more than exerting control over what the "third parties" are allowed to do. Tell them to use the same editor that you use and to save the modified file in the format that your program expects.

Editors such as Notepad++ are capable of detecting the type of the file, so you can run a check on the data file before letting your program read it.

jimdempseyatthecove · ‎09-01-2017

I suggest you examine the problematic UTF-8 encoded file with a Hex Dump editor to see what else is in the file.

One problem you may have (just guessing) is Notepad was doing you a "favor" by not providing just the plain ASCII code, but rather assuming that you wanted to set the character set to that of the default font. IOW, instead of 7-bit ASCII, you actually have 11-bit, 16-bit, or 21-bit characters (with flag bits), stored in 2, 3 or 4 bytes.

When I use Notepad to create your two line file. The UTF-8 copy has two additional bytes at the start of the file.

In decimal: 239, 187

binary: 1110 1111, 1011 1011

If you can assure the UTF-8 files are presumably ASCII (with a few extraneous UTF-8 multi-byte characters), then open the file in stream mode, read a line of text into a character variable, process the line, character-by-character, removing (squishing out) any character with an binary code greater than 127. You can then use a READ statement with the character variable name in place of where you have the unit number.

Jim Dempsey

Sebastiano_P_ · ‎09-01-2017

Thank you all for the useful replies.
Actually I accidentally inverted the files: I confirm that the ANSI does not give any problem, while the UTF-8 cannot be read.

Thank you all.

gib · ‎09-01-2017

mecej4, the OP is under the impression that he is normally working with UTF-8 files. I think he is confused, and in fact he normally works with ANSI files. Certainly his read statements are not able to read a text file created by notepad++ with numbers encoded in UTF-8.

andrew_4619 · ‎09-02-2017

to quote from wikipedia "Many Windows programs (including Windows Notepad) add the bytes 0xEF, 0xBB, 0xBF at the start of any document saved as UTF-8. This is the UTF-8 encoding of the Unicode byte order mark (BOM), and is commonly referred to as a UTF-8 BOM, even though it is not relevant to byte order."

1] open the file/.

2] read the first three bytes.

3] if you have a UFT8 BOM give the user a message to go away and provide a proper input file.

That will fix most of your problems.....

mecej4 · ‎09-02-2017

Someone who reads this thread in the future is likely to be thoroughly confused because:

THE ORIGINAL PROBLEM REPORT IS INCORRECT AND MISLEADING.

It has become clear that the program expects ANSI data files and works correctly only with files in that format. The program failures were caused when third parties were allowed to modify/create data files in/to UTF-8 format and those UTF-8 files were fed to the program.

I request Sebastian to add "PS:" notes to his messages to mitigate the confusion.

Sebastiano_P_ · ‎09-02-2017

I modified the original question.

andrew_4619, thank you for the suggestion!