Unicode and ASCII characters

bbeyer · ‎04-24-2012

Hello!
I have a question concerning the use characters in a Fortran write statement such as " " in the statements:
cfacu = "(in) "
write(6,'(A)') cfacu(1:5)

I have used these type of characters from the Courier New font for many years in CVF and they worked fine and found(probably out of ignorance!) that if the characterswouldshow correctly in the CVF text editor, they would printOKas well and read correctly in editors such as Notepad. I assume that this character is outside of the standard 128 ASCII set. Up until recently, thesecharacters worked fine in the Intel Composer with Visual Studio 2010 but I recently started seeing this text printing as:(in . This appears to now be using two characters (2 bytes?) now instead of the single character for . Is this related to transition to unicode, and is there a setting to make the write statement function as it did previously? I had experimented with greek symbols in a write statement previously and they worked fine, although as I recall I was asked if I wanted to save the file in unicode; I did not see this question on this file and the text appears correct in the IFC/Visual Studiotext editor. Thanks for any help on this.

SergeyKostrov · ‎04-24-2012

Quoting bbeyer

...I assume that this character is outside of the standard 128 ASCII set...

You can use a CRT function 'isascii'in a 3 line Ctest applicationto verify it.

Best regards,
Sergey

IanH · ‎04-24-2012

Your CVF file was probably in the single byte Windows-1252 encoding, where the superscript two is at position Z'B2' (using the Fortran syntax for hexadecimal numbers).

What may have happened is that your file has been switched inadvertently to the multibyte UTF-8 encoding, perhaps because characters not in Windows-1252 got entered or copied into it temporarily and Visual Studio has "done the right thing" to avoid information loss. The unicode codepoint for superscript 2 is still Z'B2', but in the UTF-8 encoding that gets represented in two bytes - Z'C2' Z'B2', the reappearance of Z'B2' being somewhat of an accidental consequence of the way UTF8 works for codepoints from Z'80' to Z'BF'.

If that multibyte sequence is interpreted back in the Windows-1252 codepage then the Z'C2' corresponds to the A with caret. The compiler is probably just passing the character literal straight through from the source and I don't think ifort's runtime knows about UTF-8 - ergo you may see strange output (details of which might depend on the application and font being used to show the output - if the application consuming the output knew that it was dealing with UTF8, then it could reassemble the correct unicode character. That might be tricky for the console*, but Notepad can certainly do this.).

You can inspect/change the encoding of a file in VS2010 using File > Advanced Save Options. Alternatively, File > Save xxx As, and then selecting the little down arrow next to the Save button in the following dialog and select "Save with Encoding".

You can use the inbuilt VS binary editor to see what the in-file byte representation of a character is - File > Open > File, select the filename, then click on the little down arrow next to Open, select Open With, then select Binary Editor.

(*Edit: actually just worked out that it is easy - type CHCP 65001 before you run your program...)

mecej4 · ‎04-25-2012

If your code is using only the restricted set of characters with ASCII values from 128 to 255, it will run fine with Intel Fortran, too. If, on the other hand, you do want to use Unicode or the variants UTF-8, etc., the code and the computer environment will have to follow certain protocols to have the results come out correctly. Perhaps, you should state how you compile, what version you are using, etc.

Running the code with IFort 12.1.3, I find that the single byte character set is used, with the 'superscript-2' represented by 0xFD, and the result is as given below.
[fortran]program sqo character*5 cfacu cfacu = "(in) " write(6,'(A)') cfacu(1:5) end program sqo [/fortran] 00000020 6163 7520 3d20 2228 696e fd29 2020 2020 acu = "(in})

T:\lang>sqo
(in)

bbeyer · ‎04-25-2012

Thanks for the replies. I am running Version 12.1.1106.2010 update 8. I figure that something was inadvertently changed as Ian indicates. How can Icheck what encoding is being used? Does the encoding apply to all files and projects within a solution?

Bill

bbeyer · ‎04-25-2012

Thanks Ian. I found that the Advanced Save Options did not always appear on the menu and that I had to select the tab beforethe optionswould be available, rather than selecting the file in the solution explorer. I also found that when selecting a project in the solution explorer, three options are available under save-as for the *.vcproj: UTF-8, ASCII, Unicode. My projects were all set to UTF-8; I tried changing them to ASCII but this did not fix the problem. I also found the "auto detect UTF-8 encoding without signature" option within the text editor and tried turning it off but this did not fix the problem either...

mecej4 · ‎04-25-2012

If you are going to work with Unicode files, there are several issues that you need to be aware of. There are different flavors of encoding schemes, such as UTF-8, UCS-2 (little and big-endian variants), etc. UTF-8 files may or may not have a BOM (byte order marker) at the beginning of the file. If the marker is not there and the program using the file (or the OS) does not know the correct type, things can go wrong.

The Cygwin port of the Unix/Linux file utility is useful in this context.

T:\>file temp.txt
temp.txt: Little-endian UTF-16 Unicode English text, with very long lines, with CRLF, CR line terminators

IanH · ‎04-25-2012

*.vcproj? Anyway, it sounds like you are saving the project settings file with different encoding. Change tracks! You want to play games with the source files.

Open the source file (*.f90; or *.f/*.for if at one stage you thought wearing flares was cool) in Visual Studio. I don't think you even need a solution loaded. Have the editor pane for that file active (as if you were going to edit it). Then go looking For the File > Advanced Save Options - you should get a list of about two million different encoding options. Whatever is initally presented as the encoding is what VS has used to load that file (note that you can explicitly nominate the encoding when you open a file if you go the select the little down arrow next to the Open button in the File > Open ... dialog, choose Open With, then choose "Source Code (Text) Editor with Encoding").

If your native tongue was strongly influenced in the past by the language of the Roman empire, then "Western European (Windows) - Codepage 1252" is probably what you want. I suspect VS remembers your choice for that particular file, failing that it is likely that a file with characters in positions +128 saved in 1252 would not look like valid UTF-8, so the auto-detection logic would then probably work ok.

(mecej4 - out of curiosity - did you create the source file for your sqo example program using VS or some other editor?)

mecej4 · ‎04-25-2012

> mecej4 - out of curiosity - did you create the source file for your sqo example program using VS or some other editor?

In #3, I simply cut and pasted the text given by the original poster, at a CMD window.

In #6, the file used was a known Unicode file containing English and a non-European language, having no relation to Fortran.