Unicode characters in file name in OPEN statement

Karanta__Antti · ‎01-11-2012

While debugging I noticed that OPEN statement fails to open a file when the file name contains special characters (in this case scandinavian ) encoded as utf-8. This is natural as the statement tries to interpret the string as ASCII and the byte sequence does not make sense that way.

Here's the statemet

[fortran]         OPEN(FU,FILE=FILNAM,STATUS='NEW',ERR=10,IOSTAT=IOST)
[/fortran]

The error raised is 29 (file not found).

The directory the file is tried to be created in exists but its name contains a special character.

Now the question is, how do I handle file names with special characters? I recieve the file name from C code encoded using utf-8.

tropfen · ‎01-12-2012

Hello,

i am not sure, but encoding='UTF-8' might help.

[bash]OPEN(FU,FILE=FILNAM,encoding='UTF-8',STATUS='NEW',ERR=10,IOSTAT=IOST) 
[/bash]

Frank

rase · ‎01-12-2012

The specifier "encoding='UTF-8'" refers to the content of the file, not the file name. I had a similar problem with German umlauts a long time ago (Windows XP, maybe things have changed). As far as I remember there was not other solution than to avoid the strange characters.

Steven_L_Intel1 · ‎01-12-2012

Sorry, we do not yet support UTF-8 in character values.

anthonyrichards · ‎01-12-2012

Why not just get the C code to rename the file to something acceptable?

There may be a multi-lingual code page that you can set Windows to use as well.
Maybe Latin(western European) code page 1252 ?

See http://en.wikipedia.org/wiki/Windows-1252

To find out your systems active code page use

chcp

from a DOS prompt

You can change the code page using

chcp 1252

from the same DOS prompt.

Karanta__Antti · ‎01-13-2012

Renaming the file is not an option. The problem is in the directory name and I can't tell the user not to use directories with letters from his native language. The directory with which this first came up is the Swedish Win XP's correspondent of Local Settings, something like "Lokla Instllningar".

Speaking of code pages, does OPEN accept a string encoded using the local windows code page? That would be a fairly straightforward string conversion.

anthonyrichards · ‎01-14-2012

So try changing the system code page to 1252, which contains a umlaut, and see what happens?
you can always change it back.

Karanta__Antti · ‎01-16-2012

Ok, got it to work by converting the utf-8 string to local windows code page (did not need to change it with chcp).

I used windows function MultiByteToWideChar to first convert the utf-8 string to utf-16 and thenWideCharToMultiByte to convert the utf-16 string to windows local code page. The OPEN statement seems to handle the file name with special chars just fine now.

anthonyrichards · ‎01-16-2012

Well done.
Is it permissable for you to post the essential details of the code that you developed successfully?
It would be much appreciated I'm sure.

Karanta__Antti · ‎01-16-2012

Sure, this code is in no way a trade secret, just "normal" fiddling with string conversions. I hope this saves someone else some time:

Here's the c functions:

[cpp]extern wchar_t* nosUTF8ToUTF16( const char * utf8string, OSInt nchars ) {
  int requiredSize, writtenSize ;
  wchar_t * result ;

  SetLastError( 0 ) ;

  requiredSize = 1 +
    MultiByteToWideChar( CP_UTF8, 
                         0, 
                         utf8string, 
                         (int)nchars, 
                         NULL, 
                         0 // if 0, func returns the size of the required buffer (in wchar_t)
                         ) ;
    
  result = nosCAllocate( requiredSize, sizeof( wchar_t ) ) ;
                       
  writtenSize = 
    MultiByteToWideChar( CP_UTF8, 
                         0, 
                         utf8string, 
                         (int)nchars, 
                         result, 
                         requiredSize
                         ) ;

  assert( requiredSize == writtenSize + 1 ) ;

  result[ writtenSize ] = 0 ;
  assert( writtenSize == wcslen( result ) ) ;

  return result ;
}

extern char* nosUTF16ToLocalCodePage( const wchar_t * utf16string, OSInt nchars ) {

  int requiredSize, writtenSize ;
  char * result ;

  requiredSize = 1 +
    WideCharToMultiByte( CP_ACP,
                         0, 
                         utf16string, 
                         (int)nchars, 
                         NULL, 
                         0, 
                         NULL, 
                         NULL
                         ) ;

  result = nosCAllocate( requiredSize, 1 ) ;

  writtenSize =
    WideCharToMultiByte( CP_ACP,
                         0, 
                         utf16string, 
                         (int)nchars, 
                         result, 
                         requiredSize, 
                         NULL,
                         NULL
                         ) ;

  assert( writtenSize + 1 == requiredSize ) ;

  result[ writtenSize ] = 0 ;

  return result ;
}

extern char* nosUTF8ToLocalCodePage( const char * utf8string, OSInt nchars ) {

  char *local_string = NULL ;
  wchar_t* utf16tmp ;

  utf16tmp = nosUTF8ToUTF16( utf8string, nchars ) ;

  local_string = nosUTF16ToLocalCodePage( utf16tmp, -1 ) ;

  nosFree( utf16tmp ) ;

  return local_string ;
}
[/cpp]

And here's the Fortran binding:

[fortran]MODULE XXXX

   INTERFACE 
      
      ! length of null terminated string (for c interop)
      PURE INTEGER( KIND = C_SIZE_T ) FUNCTION OS_STRLEN( STR ) BIND( C, NAME = "strlen" )
         USE, INTRINSIC :: ISO_C_BINDING
         TYPE( C_PTR ), INTENT(IN), VALUE :: STR
      END FUNCTION

      SUBROUTINE NOS_FREE( C_POINTER ) BIND( C, NAME = "nosFree" )
         USE, INTRINSIC :: ISO_C_BINDING
         TYPE( C_PTR ), INTENT(IN), VALUE :: C_POINTER
      END SUBROUTINE

      FUNCTION NOS_UTF8_TO_LOCAL_CODE_PAGE( UTF8_STRING, NCHARS ) RESULT( RESULT_STRING_PTR ) BIND( C, NAME = "nosUTF8ToLocalCodePage" )
         USE, INTRINSIC :: ISO_C_BINDING
         
         CHARACTER( KIND = C_CHAR ), INTENT(IN) :: UTF8_STRING(*)
         INTEGER, INTENT(IN), VALUE :: NCHARS
         TYPE( C_PTR ) :: RESULT_STRING_PTR
      END FUNCTION

      TYPE(C_PTR) FUNCTION OS_STRNCPY( TARGET_STRING, C_POINTER, N ) BIND( C, NAME = "memcpy" )
         USE, INTRINSIC :: ISO_C_BINDING
         CHARACTER( KIND = C_CHAR ), INTENT(IN) :: TARGET_STRING(*)
         TYPE( C_PTR ), INTENT(IN), VALUE :: C_POINTER
         INTEGER( KIND = C_SIZE_T ), VALUE :: N
      END FUNCTION

   END INTERFACE

CONTAINS
   FUNCTION OS_UTF8_TO_LOCAL_CODE_PAGE( STR ) RESULT( RESULT_STR )

      USE, INTRINSIC :: ISO_C_BINDING

      CHARACTER( LEN = * ), INTENT(IN) :: STR

      CHARACTER( LEN = : ), ALLOCATABLE :: RESULT_STR

      TYPE( C_PTR ) :: CHAR_PTR, IGNORED
      INTEGER( KIND = C_INT ) :: STR_LENGTH

      IF ( STR == '' ) THEN
         RESULT_STR = STR
      ELSE
         CHAR_PTR = NOS_UTF8_TO_LOCAL_CODE_PAGE( STR, LEN( STR ) )
         STR_LENGTH = OS_STRLEN( CHAR_PTR )
         ALLOCATE( CHARACTER( LEN = STR_LENGTH ) :: RESULT_STR )

         IGNORED = OS_STRNCPY( RESULT_STR, CHAR_PTR, INT( STR_LENGTH, C_SIZE_T ) )
         ASSERT( C_ASSOCIATED( IGNORED, C_LOC( RESULT_STR ) ), '' )
         CALL NOS_FREE( CHAR_PTR )
      END IF
   END FUNCTION
[/fortran]

Except for our custom wrappers around calloc and free, typedef OSInt to match default integer size in Fortran and our port of C style ASSERT to Fortran the code does not use anything application specific and should be quite easily reusable by others.

anthonyrichards · ‎01-16-2012

Thanks.
However, I was hoping it would all be in FORTRAN (i.e. use the Fortran wrappers for the multi-byte Windows API functions) !

However, can you please explain for a non-C programmer how it solved your failure to 'OPEN' with a Fortran filename string containing characters such as a umlaut when you did not appear to change the code page, but used CP_ACP in the C-code, which appears to just refer to an 'ANSI code page'. which I presume means the one already in use? What code page is set for your system?

What would happen if your code was used on a system set to use, say, Code page 850 if a file was found/supplied with a name containing a umlaut or similar?

Karanta__Antti · ‎01-16-2012

I did not use all Fortran as I find playing around with C functions a lot easier to do in C.

From http://support.microsoft.com/kb/108450

"CP_ACP instructs the API to use the currently set default Windows ANSI codepage."

My current code page:

C:\Users\nak>chcp
Active code page: 437

I worked under the assumption that OPEN wants the file name encoded using the current windows code page. It seems to work, even though I only tested with a few nonproblematic cases and a few that did not work previously.

If the local windows code page was such that the special characters in the original string could not be presented they would be replaced with a "default" char, thus making the path different from the intended. However, in my case it is enough that I can handle the paths on the user's file system and I am assuming the paths contain only characters that can be represented using the local windows code page. Seems like a reasonable assumption, although I'm not sure how Windows presents the file names internally.

anthonyrichards · ‎01-16-2012

From your posts it would appear that you are receiving a string of bytes (UTF-8) which is not equal to an ANSI character string and therefore not acceptable to FORTRAN as a valid string so clearly you have to convert it to an ANSI string before FORTRAN can use it as a filename. I misunderstood, and thought you were receiving a character string such as "c:\ardvrk2.txt" and not being able to open it because of the "" characters, which you clearly should be able to with most code pages, which do contain "".

Paul_B_4 · ‎06-10-2014

I successfully opened (and wrote to) file paths with Unicode characters on Windows using the USEROPEN call-back of the OPEN function. The call-back invokes Windows CreateFileW (as suggested here). I expect reading files to be similar.