Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Highlighted
Beginner
7 Views

ifort 17 beta: coarray image-to-image transfer of character data

I have a somewhat subtle problem with the ifort 17 beta when doing image-to-image data transfer of character data using derived type coarrays. The subtle part of this problem: the problem only occurs with a slightly sophisticated code structure. I tried some simple example programs but none of them raised any problem.
Just to give a first coarray example program:

  • you may download the .f90 source code files, the start.txt file, as well as the start folder (containing further required text files) from the GitHub repository at: https://github.com/MichaelSiehl/MPMD-with-Fortran-2008-Coarrays/tree/master/src
  • open the start.txt file with an editor and replace the directory path therein with your actual path to the downloaded 'start' folder
  • compile the program using ifort 17 beta (on Linux): ifort -coarray -coarray-num-images=13 OOOGglob_Globals.f90 OOOEerro_admError.f90 OOOPstpa_admStartPath.f90 OOOPimsc_admImageStatus_CA.f90 OOOPtmec_admTeamMember_CA.f90 OOOPtemc_admTeamManager_CA.f90 OOOPimmc_admImageManager_CA.f90 OOOPinmc_admInitialManager_CA.f90 OOOPtmem_admTeamMember.f90 OOOPtema_admTeamManager.f90 OOOPinma_admInitialManager.f90 OOOPimma_admImageManager.f90 Main_Sub.f90 Main.f90 -o a.out
  • run the compiled program from the command line: a.out
  • the program contains a primitive sequential error handler (which may raise an error itself due to a read statement). The error messages are less helpful since they do point to a 'File-Open-error', which is not the cause of the error.
  • instead, the real source of the error is the coarray image-to-image data transfer of character data: The character data transfer itself does happen. But on the remote image, the first letter of the character data turns into some undefined character sign.
  • This error does only occur at a later point during program execution: preceding character image-to-image data transfer (with a very similar source code) does not raise any problem.

I remember a similar problem with ifort 13 or 14, and also a simple solution to this: alter the scalar character (derived type coarray component) into an (one-element) array character (derived type coarray component). I will try this trick shortly, just to check if it does work with the ifort 17 beta.
Nevertheless, it seems to be a bug with the ifort 17 beta currently.
best regards
Michael

0 Kudos
13 Replies
Highlighted
7 Views

I am looking at this now.

I did notice an unrelated coding error in OOOGglob_subSetProcedures - you need to copy the test (OOOGglob_intStackTraceCounter > 1) from OOOGglob_subResetProcedures.

I am aware of a bug (which should be fixed in the beta update) related to a concatenation of a coindexed character reference with a local value, but I don't see any uses of that here.

0 Kudos
Highlighted
7 Views

I can build and run the program, but haven't found anything similar to what you describe. There are some additional array bound errors in the procedure stack code, but they appear harmless. The routine that prompts for an action after an error might run from an image other than 1, which can't read from unit *. The various start files assume at least 13 images, a constraint not noted in the description.

When I run with 13 images all the team managers start and the program exits without any error messages. (If I run with more than 13 images then the program hangs as the other images appear to be waiting for something to do.)

What exactly goes wrong for you and how did you conclude it is a character value issue?

0 Kudos
Highlighted
Beginner
7 Views

Thanks very much for investigating.
It is good news if you can run the example program 'as it is' without any error message with 13 coarray images using ifort 17 beta. Then the bug could be related (and hopefully limited) only to my own current systems (Linux Ubuntu 14.04 LTS). Nevertheless, I will do some further investigations, just to be sure.

|I did notice an unrelated coding error in OOOGglob_subSetProcedures - you need to copy the test (OOOGglob_intStackTraceCounter > 1) |from OOOGglob_subResetProcedures.
This is indeed unrelated to the current problem. The subSetProcedures routine is very primitive and just to maintain a rudimentary stack trace for the purpose of this example program. I plan to develop a parallel stack trace in the near future.

|I am aware of a bug (which should be fixed in the beta update) related to a concatenation of a coindexed character reference with a local |value, but I don't see any uses of that here.
From my current investigation I am somewhat confident that the problem on my system is related to the image-to-image data transfer only of the character data.

|I can build and run the program, but haven't found anything similar to what you describe.
The example program raises the error, but you are right, does not show it directly on screen. I did some modifications to make it visible, but these are not online yet. (see below)

|The routine that prompts for an action after an error might run from an image other than 1, which can't read from unit *.
The (sequential) error handler is extremely primitive indeed and should not contain a read statement with a coarray program. Nevertheless, it's easy to uncomment and the read statement is still useful in some situations (mostly when the error occurs on image1). I also plan to develop a parallel error handler in the near future. This is unrelated to the current problem.

|The various start files assume at least 13 images, a constraint not noted in the description.
|(If I run with more than 13 images then the program hangs as the other images appear to be waiting for something to do.)
In its original state, the files in the start folder do require (exactly) 13 coarray images for the execution of the example program. Nevertheless, the files content can be altered for the example program to run with more or less images. (This is briefly explained in section 3.1 of the accompanied pdf-document: https://github.com/MichaelSiehl/MPMD-with-Fortran-2008-Coarrays/blob/master/MPMD_with_Coarray_Fortra... This is also unrelated to the current problem.

|When I run with 13 images all the team managers start and the program exits without any error messages.
Great. This would mean that the coarray character data transfer was successful and that the bug does not occur on your system. (You should also see start messages from the 9 team member images on your screen).

|What exactly goes wrong for you and how did you conclude it is a character value issue?
I will give you some very few code changes (write statements) shortly, that allow to make the image-to-image character data transfer failure visible on screen. This is possible because the data transfer takes place, but single letters of the character variables turn into something else after remote transfer. Also, if you can't reproduce the error on your own system, I could give you a screen shoot from my own system. (I hope there is some screen shoot facility for Linux Ubuntu).
Best Regards
Michael S.

0 Kudos
Highlighted
7 Views

A screenshot will help me understand what you are seeing, but if I can't reproduce it there's little I can do to investigate. 

0 Kudos
Highlighted
Beginner
7 Views

| A screenshot will help me understand what you are seeing, but if I can't reproduce it there's little I can do to investigate. 

Indeed. As a coarray developer, I have a strong personal interest for further investigation myself. Thus, I will firstly try out other common Linux distributions (Ubuntu 16.04 for an example) by myself as a next step, - just to be sure the 'bug' is real. And if so, a follow up step should be to open another GitHub repository with an adopted example program (giving explicit information if the bug occurs) and better explanations (pointing directly to the source code in question) to make it easier for others to investigate on this. I will inform you shortly.

0 Kudos
Highlighted
Beginner
7 Views

Update:
Within my own further testing, the problem with the image-to-image transfer of character data remains to be a real one. I believe it is a bug with the ifort 17 beta (on my Linux computer). I did use the ifort 17 beta update 2 on Linux Ubuntu 14.04 LTS and 16.04 LTS (the later being an unsupported OS but this does not make a difference in practice) on a shared memory computer (laptop).
I did streamline the original program and also did open a new GitHub repository with two (nearly identical) test programs: one with a derived type coarray containing a scalar character component (raising a run-time error due to an image-to-image transfer failure of the character data), and another one using a derived type coarray containing an (one-element) array character component (running without any problems). See my explanations at:
https://github.com/MichaelSiehl/Coarray_Test_Program_for_Character_Data
The critical code is in the source code file OOOPimmc_admImageManager_CA.f90: within the functioning test program (using an array character component) only four source code lines are different from the error-raising version. These source code lines are marked with the date stamp '160726'. I hope this helps.
Best Regards
 

0 Kudos
Highlighted
Black Belt
7 Views

If the code on the github repository is compiled with runtime checks, then out of bounds array access errors are reported very early in the life of the program.  Memory corruption is a plausible symptom of this sort of programming error.
 

0 Kudos
Highlighted
Beginner
7 Views

Hi Ian,
thanks very much for investigating and pointing to that. As Steve has already pointed out, the array bounds errors are due to a coding error in subroutine OOOGglob_subResetProcedures (a very rudimentary stack trace). I did forget to out-comment this code but did so by now. The coarray error is not related to that. Please download the newly corrected code from the GitHub repository for further investigation. When using ifort's -check option, you should not see any array bounds errors by now, but also no coarray error at all. The error does only occur when the code is compiled without the -check option:

ifort -coarray -coarray-num-images=3 OOOGglob_Globals.f90 OOOEerro_admError.f90 OOOPstpa_admStartPath.f90 OOOPimsc_admImageStatus_CA.f90 OOOPtmec_admTeamMember_CA.f90 OOOPtemc_admTeamManager_CA.f90 OOOPimmc_admImageManager_CA.f90 OOOPinmc_admInitialManager_CA.f90 OOOPtmem_admTeamMember.f90 OOOPtema_admTeamManager.f90 OOOPinma_admInitialManager.f90 OOOPimma_admImageManager.f90 Main_Sub.f90 Main.f90 -o a.out

Best Regards

0 Kudos
Highlighted
7 Views

Ok, I can reproduce this and will investigate further. Thanks for your effort to narrow it down.

0 Kudos
Highlighted
7 Views

What is happening is that the store of the coindexed component in:

Object_CA[intImageNumber] % m_chrTeamMembersFileName = chr_TeamMembersFileName

is picking up data at the start of the derived type object rather than at offset 4 where the character field lives. If you swap the component order in the declaration of OOOPimmc_adtImageManager_CA so that the character value comes first, it works correctly. I also confirmed that 16.0.3 is ok but 17.0 beta is not.

It might be that it's coindexed fetch that is wrong instead.

This is clearly a bug and I will now send it on to the developers. Unfortunately it's too late to get a fix in for the initial 17.0 product release but we'll try to get it fixed for Update 1., Issue ID is DPD200413120.

0 Kudos
Highlighted
Beginner
7 Views

ok, thanks very much.

0 Kudos
Highlighted
7 Views

I expect this to be fixed in Update 2 to Parallel Studio XE 2017.

0 Kudos
Highlighted
Beginner
7 Views

Thanks for the effort on this. In the meantime, the simple solution is to use ifort's -check option.

0 Kudos