Solved: What should ENCODING='UTF-8' do?

Mark_Lewy · ‎10-20-2023

The Fortran standard states:

12.5.6.9 ENCODING= specifier in the OPEN statement
The scalar-default-char-expr shall evaluate to UTF-8 or DEFAULT. The ENCODING= specifier is permitted
only for a connection for formatted input/output. The value UTF-8 specifies that the encoding form of the file
is UTF-8 as specified in ISO/IEC 10646. Such a file is called a Unicode file, and all characters therein are of ISO
10646 character kind. The value UTF-8 shall not be specified if the processor does not support the ISO 10646
character kind. The value DEFAULT specifies that the encoding form of the file is processor dependent. If this
specifier is omitted in an OPEN statement that initiates a connection, the default value is DEFAULT.

As Intel Fortran doesn't support an ISO 10646 character KIND, AFAIK:

1) Should the compiler diagnose the use of 'UTF-8' as a standards violation?

2) Is it (as I assume) behaving as if the value was 'DEFAULT'?

Is there any likelihood that Intel Fortran will support Unicode in the near future?

Steve_Lionel · ‎10-20-2023

1) No, this is not something the compiler is required to diagnose, though I think it would be good if it did.

2) The standard does not specify what the behavior should be, so it is implementation-dependent. My guess is that this value is ignored, though the compiler does complain if you say something other than DEFAULT or UTF-8.

View solution in original post

Steve_Lionel · ‎10-20-2023

1) No, this is not something the compiler is required to diagnose, though I think it would be good if it did.

2) The standard does not specify what the behavior should be, so it is implementation-dependent. My guess is that this value is ignored, though the compiler does complain if you say something other than DEFAULT or UTF-8.

Mark_Lewy · ‎10-25-2023

Thanks Steve, that's what I thought.

As a follow up, this is the issue we have, our simulation engines process (text) job files that look like WIndows INI files & contains paths. For example:

[Files]
2DF=C:\ProgramData\Innovyze\InfoWorksAgent\SA_14D4C022-9EB8-4CD4-B7F4-F4C67614F748\iwswnet2#4.2df
2DZ=C:\ProgramData\Innovyze\InfoWorksAgent\SA_14D4C022-9EB8-4CD4-B7F4-F4C67614F748\iwswnet2#4.2dz
IWR=C:\Users\lewym\AppData\Local\Innovyze\Results Folder\14D4C022-9EB8-4CD4-B7F4-F4C67614F748\iwswsim12.iwr
QIN=C:\ProgramData\Innovyze\InfoWorksAgent\SA_14D4C022-9EB8-4CD4-B7F4-F4C67614F748\iwswsim12eventiid81229.qin
RunStatistics=C:\Users\lewym\AppData\Local\Innovyze\Results Folder\14D4C022-9EB8-4CD4-B7F4-F4C67614F748\iwswsim12.analytics
RPT=C:\Users\lewym\AppData\Local\Innovyze\Results Folder\14D4C022-9EB8-4CD4-B7F4-F4C67614F748\iwswsim12.rpt
INP=C:\ProgramData\Innovyze\InfoWorksAgent\SA_14D4C022-9EB8-4CD4-B7F4-F4C67614F748\sa_14d4c022-9eb8-4cd4-b7f4-f4c67614f748_218_12_lewym_20231011_095044sim_job.inp

---END---

What happens, if the username contains non-ASCII characters or (as they can) there is a non-default location for results?

The documentation for FILE= is not very forthcoming:

FILE = name

name	Is a character or numeric expression. The name can be any pathname allowed by the operating system. Any trailing blanks in the name are ignored.

What does "allowed by the operating system" mean. Does it assume that the characters are encoded for the current code page on Windows? What about Linux? Encoding the paths as UTF-8 appears to not work on Windows.

andrew_4619 · ‎10-25-2023

What does "allowed by the operating system" mean. Well for example windows file name cannot have &, * and a list of other characters but the Fortran runtime will use other libraries lower down in the detailed file handling so that is where an error is likely to be generated. That aspect will be implementation/OS specific and not standard to Fortran. The upper end of the Intel Fortran assumes that names are ASCII character strings. (correct me if I am out of date here someone) If you want to do file i/o using Unicode names that the OS supports then you will need to use utilities (such as windows SDK) for file management and IO. It is not clear to me from the question what your specific problem(s) are is it only reading files that are UTF-8 encoded, or is it creating/writing files based on what is read?

Mark_Lewy · ‎10-25-2023

The problem I believe is that the file entries in the job file are being written to the job file as UTF-8 and the Fortran code to read the job file is reading them into character variables. These are subsequently used as FILE specifiers in OPEN statements, which can fail. I suspect that underlying OPEN on Windows is a call to OpenFile, in which case "The string must consist of characters from the 8-bit Windows character set. The OpenFile function does not support Unicode file names or opening named pipes.", so unsurprisingly passing a UTF-8 encoded string fails. In most cases, this is a non-issue, as the process normally creating the job file writes paths that are ASCII, but there is the potential for oddities in the cases I mentioned above and for other applications creating the job file.

I suppose the answer is to encode the paths for the current code page on Windows when creating the job file.

In the longer term, if Intel Fortran had the Unicode character kind “ISO_10646”, like GNU Fortran (for example) we could use UTF-8 encoding.

jimdempseyatthecove · ‎10-25-2023

This article might be informative.

And this thread might be informative (see post at 01-16-2012 03:16 AM by Karanta__Antti)

Jim Dempsey

Mark_Lewy · ‎10-25-2023

Thanks Jim, that confirms my thinking.

Barbara_P_Intel · ‎10-25-2023

Here's a shot in the dark! I'm no expert by far, but will this UNICODE routine help? MBConvertUnicodeToMB. Here's a summary of routines for National Language Support. These are Intel specials.

I investigated something similar for a customer a few years ago.

Their question:

If READ a unicode, multibyte file path (as a CHAR*) and pass it untouched to an OPEN statement, will it work on a Japanese OS or Korean OS or Chinese/Mandarin OS?

The solution:

subroutine test_file_open(filename,len)
USE IFNLS
integer :: len
!DIR$ ATTRIBUTES VALUE :: len
integer(2)::filename(len) ! array contains the Unicode file name
integer(4):: res
character*100:: ffname
res = MBConvertUnicodeToMB(filename,ffname) ! do the conversion, return the result string
length
write(*,*) ffname(1:res)
open (8, file=ffname(1:res), action='WRITE') ! pass result MB string to OPEN statement
write (8,*) 'Testing file writing'
close (8)
end subroutine

andrew_4619 · ‎10-25-2023

I played with National Language Support a few years back and found it was broken, I would bet no one has worked on in since that time...

Mark_Lewy · ‎10-26-2023

Thanks Barbara, that's one way of doing this.

I found some other code of ours that was using the WideCharToMultiByte Windows API from the kernel32 module to convert Unicode (wchar_t) to multibyte. This was using CP_UTF8 (UTF-8) except for strings that were going to be used as FILE specifiers in OPEN statements, which are converted with CP_ACP (ANSI code page). So, I think I've answered my own question.

Barbara_P_Intel · ‎10-25-2023

A few years ago this sample worked.

Steve_Lionel · ‎10-26-2023

I've successfully used a USEROPEN routine (DEC/Intel extension) to open a file using UTF-8 encoding in the file path. Sorry, I no longer have the example (unless it's in this forum somewhere), but it did work. I do wish Intel would get UTF-8 supported in Fortran, as other compilers already have.