Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.
28455 Discussions

Unicode characters in file name in OPEN statement

Karanta__Antti
New Contributor I
2,106 Views

While debugging I noticed that OPEN statement fails to open a file when the file name contains special characters (in this case scandinavian ) encoded as utf-8. This is natural as the statement tries to interpret the string as ASCII and the byte sequence does not make sense that way.

Here's the statemet

[fortran]         OPEN(FU,FILE=FILNAM,STATUS='NEW',ERR=10,IOSTAT=IOST)
[/fortran]

The error raised is 29 (file not found).

The directory the file is tried to be created in exists but its name contains a special character.

Now the question is, how do I handle file names with special characters? I recieve the file name from C code encoded using utf-8.

13 Replies
tropfen
New Contributor I
2,106 Views
Hello,

i am not sure, but encoding='UTF-8' might help.

[bash]OPEN(FU,FILE=FILNAM,encoding='UTF-8',STATUS='NEW',ERR=10,IOSTAT=IOST) 
[/bash]
Frank
0 Kudos
rase
New Contributor I
2,106 Views
The specifier "encoding='UTF-8'" refers to the content of the file, not the file name. I had a similar problem with German umlauts a long time ago (Windows XP, maybe things have changed). As far as I remember there was not other solution than to avoid the strange characters.
0 Kudos
Steven_L_Intel1
Employee
2,106 Views
Sorry, we do not yet support UTF-8 in character values.
0 Kudos
anthonyrichards
New Contributor III
2,106 Views
Why not just get the C code to rename the file to something acceptable?

There may be a multi-lingual code page that you can set Windows to use as well.
Maybe Latin(western European) code page 1252 ?

See http://en.wikipedia.org/wiki/Windows-1252

To find out your systems active code page use

chcp

from a DOS prompt

You can change the code page using

chcp 1252

from the same DOS prompt.
0 Kudos
Karanta__Antti
New Contributor I
2,106 Views

Renaming the file is not an option. The problem is in the directory name and I can't tell the user not to use directories with letters from his native language. The directory with which this first came up is the Swedish Win XP's correspondent of Local Settings, something like "Lokla Instllningar".

Speaking of code pages, does OPEN accept a string encoded using the local windows code page? That would be a fairly straightforward string conversion.

0 Kudos
anthonyrichards
New Contributor III
2,106 Views
So try changing the system code page to 1252, which contains a umlaut, and see what happens?
you can always change it back.
0 Kudos
Karanta__Antti
New Contributor I
2,106 Views

Ok, got it to work by converting the utf-8 string to local windows code page (did not need to change it with chcp).

I used windows function MultiByteToWideChar to first convert the utf-8 string to utf-16 and thenWideCharToMultiByte to convert the utf-16 string to windows local code page. The OPEN statement seems to handle the file name with special chars just fine now.

0 Kudos
anthonyrichards
New Contributor III
2,106 Views
Well done.
Is it permissable for you to post the essential details of the code that you developed successfully?
It would be much appreciated I'm sure.
0 Kudos
Karanta__Antti
New Contributor I
2,106 Views

Sure, this code is in no way a trade secret, just "normal" fiddling with string conversions. I hope this saves someone else some time:

Here's the c functions:

[cpp]extern wchar_t* nosUTF8ToUTF16( const char * utf8string, OSInt nchars ) {
  int requiredSize, writtenSize ;
  wchar_t * result ;

  SetLastError( 0 ) ;

  requiredSize = 1 +
    MultiByteToWideChar( CP_UTF8, 
                         0, 
                         utf8string, 
                         (int)nchars, 
                         NULL, 
                         0 // if 0, func returns the size of the required buffer (in wchar_t)
                         ) ;
    
  result = nosCAllocate( requiredSize, sizeof( wchar_t ) ) ;
                       
  writtenSize = 
    MultiByteToWideChar( CP_UTF8, 
                         0, 
                         utf8string, 
                         (int)nchars, 
                         result, 
                         requiredSize
                         ) ;

  assert( requiredSize == writtenSize + 1 ) ;

  result[ writtenSize ] = 0 ;
  assert( writtenSize == wcslen( result ) ) ;

  return result ;
}

extern char* nosUTF16ToLocalCodePage( const wchar_t * utf16string, OSInt nchars ) {

  int requiredSize, writtenSize ;
  char * result ;

  requiredSize = 1 +
    WideCharToMultiByte( CP_ACP,
                         0, 
                         utf16string, 
                         (int)nchars, 
                         NULL, 
                         0, 
                         NULL, 
                         NULL
                         ) ;

  result = nosCAllocate( requiredSize, 1 ) ;

  writtenSize =
    WideCharToMultiByte( CP_ACP,
                         0, 
                         utf16string, 
                         (int)nchars, 
                         result, 
                         requiredSize, 
                         NULL,
                         NULL
                         ) ;

  assert( writtenSize + 1 == requiredSize ) ;

  result[ writtenSize ] = 0 ;

  return result ;
}

extern char* nosUTF8ToLocalCodePage( const char * utf8string, OSInt nchars ) {

  char *local_string = NULL ;
  wchar_t* utf16tmp ;

  utf16tmp = nosUTF8ToUTF16( utf8string, nchars ) ;

  local_string = nosUTF16ToLocalCodePage( utf16tmp, -1 ) ;

  nosFree( utf16tmp ) ;

  return local_string ;
}
[/cpp]
And here's the Fortran binding:

[fortran]MODULE XXXX

   INTERFACE 
      
      ! length of null terminated string (for c interop)
      PURE INTEGER( KIND = C_SIZE_T ) FUNCTION OS_STRLEN( STR ) BIND( C, NAME = "strlen" )
         USE, INTRINSIC :: ISO_C_BINDING
         TYPE( C_PTR ), INTENT(IN), VALUE :: STR
      END FUNCTION

      SUBROUTINE NOS_FREE( C_POINTER ) BIND( C, NAME = "nosFree" )
         USE, INTRINSIC :: ISO_C_BINDING
         TYPE( C_PTR ), INTENT(IN), VALUE :: C_POINTER
      END SUBROUTINE

      FUNCTION NOS_UTF8_TO_LOCAL_CODE_PAGE( UTF8_STRING, NCHARS ) RESULT( RESULT_STRING_PTR ) BIND( C, NAME = "nosUTF8ToLocalCodePage" )
         USE, INTRINSIC :: ISO_C_BINDING
         
         CHARACTER( KIND = C_CHAR ), INTENT(IN) :: UTF8_STRING(*)
         INTEGER, INTENT(IN), VALUE :: NCHARS
         TYPE( C_PTR ) :: RESULT_STRING_PTR
      END FUNCTION

      TYPE(C_PTR) FUNCTION OS_STRNCPY( TARGET_STRING, C_POINTER, N ) BIND( C, NAME = "memcpy" )
         USE, INTRINSIC :: ISO_C_BINDING
         CHARACTER( KIND = C_CHAR ), INTENT(IN) :: TARGET_STRING(*)
         TYPE( C_PTR ), INTENT(IN), VALUE :: C_POINTER
         INTEGER( KIND = C_SIZE_T ), VALUE :: N
      END FUNCTION

   END INTERFACE

CONTAINS
   FUNCTION OS_UTF8_TO_LOCAL_CODE_PAGE( STR ) RESULT( RESULT_STR )

      USE, INTRINSIC :: ISO_C_BINDING

      CHARACTER( LEN = * ), INTENT(IN) :: STR

      CHARACTER( LEN = : ), ALLOCATABLE :: RESULT_STR

      TYPE( C_PTR ) :: CHAR_PTR, IGNORED
      INTEGER( KIND = C_INT ) :: STR_LENGTH

      IF ( STR == '' ) THEN
         RESULT_STR = STR
      ELSE
         CHAR_PTR = NOS_UTF8_TO_LOCAL_CODE_PAGE( STR, LEN( STR ) )
         STR_LENGTH = OS_STRLEN( CHAR_PTR )
         ALLOCATE( CHARACTER( LEN = STR_LENGTH ) :: RESULT_STR )

         IGNORED = OS_STRNCPY( RESULT_STR, CHAR_PTR, INT( STR_LENGTH, C_SIZE_T ) )
         ASSERT( C_ASSOCIATED( IGNORED, C_LOC( RESULT_STR ) ), '' )
         CALL NOS_FREE( CHAR_PTR )
      END IF
   END FUNCTION
[/fortran]
Except for our custom wrappers around calloc and free, typedef OSInt to match default integer size in Fortran and our port of C style ASSERT to Fortran the code does not use anything application specific and should be quite easily reusable by others.

0 Kudos
anthonyrichards
New Contributor III
2,106 Views
Thanks.
However, I was hoping it would all be in FORTRAN (i.e. use the Fortran wrappers for the multi-byte Windows API functions) !

However, can you please explain for a non-C programmer how it solved your failure to 'OPEN' with a Fortran filename string containing characters such as a umlaut when you did not appear to change the code page, but used CP_ACP in the C-code, which appears to just refer to an 'ANSI code page'. which I presume means the one already in use? What code page is set for your system?

What would happen if your code was used on a system set to use, say, Code page 850 if a file was found/supplied with a name containing a umlaut or similar?
0 Kudos
Karanta__Antti
New Contributor I
2,106 Views

I did not use all Fortran as I find playing around with C functions a lot easier to do in C.

From http://support.microsoft.com/kb/108450

"CP_ACP instructs the API to use the currently set default Windows ANSI codepage."

My current code page:

C:\Users\nak>chcp
Active code page: 437

I worked under the assumption that OPEN wants the file name encoded using the current windows code page. It seems to work, even though I only tested with a few nonproblematic cases and a few that did not work previously.

If the local windows code page was such that the special characters in the original string could not be presented they would be replaced with a "default" char, thus making the path different from the intended. However, in my case it is enough that I can handle the paths on the user's file system and I am assuming the paths contain only characters that can be represented using the local windows code page. Seems like a reasonable assumption, although I'm not sure how Windows presents the file names internally.

0 Kudos
anthonyrichards
New Contributor III
2,106 Views
From your posts it would appear that you are receiving a string of bytes (UTF-8) which is not equal to an ANSI character string and therefore not acceptable to FORTRAN as a valid string so clearly you have to convert it to an ANSI string before FORTRAN can use it as a filename. I misunderstood, and thought you were receiving a character string such as "c:\ardvrk2.txt" and not being able to open it because of the "" characters, which you clearly should be able to with most code pages, which do contain "".
0 Kudos
Paul_B_4
Beginner
2,106 Views

I successfully opened (and wrote to) file paths with Unicode characters on Windows using the USEROPEN call-back of the OPEN function. The call-back invokes Windows CreateFileW (as suggested here). I expect reading files to be similar.

 

0 Kudos
Reply