Solved: intrinsic function "ptr1(:,:,:) = ptr2(:,:,:)" segfaults -- ifort bug

Jens_Henrik_Goebbert · ‎10-07-2008

Short Description:

=> Loop over i,j,k ("ptr1(i,j,k) = ptr2(i,j,k)") works fine, but "ptr1(:,:,:) = ptr2(:,:,:)" segfaults

The attached program segfaults if it uses the intrinsic function ptr1(:,:,:)=ptr2(:,:,:) to copy data of one pointer-array to another pointer-array. The number of array-elements must exceed a certain number - in my case it segfaults with 129*128*256 but works fine with 65*64*128.

I cannot see any Fortran-Style-Violation, that's why I am sure it is a bug of the Intel Fortran Compiler 10.1.011

Long Description:

The attached program is doing the following:

allocate one big chunk of memory - "allocate(mem3d(msize(1),msize(2),msize(3),2), stat=ierr)"
associating two 3d-pointer with parts of allocated memory:
ptr1(msize(1),msize(2),msize(3)) => mem3d(msize(1),msize(2),msize(3),1)
ptr2(msize(1),msize(2),msize(3)) => mem3d(msize(1),msize(2),msize(3),2)
fill ptr1 and ptr2 with some data (no segfault)
ptr1(:,:,:) = 1
ptr2(:,:,:) = 2
loop over i,j,k (no segfault)
ptr1(i,j,k) = ptr2(i,j,k)")
use intrinsic function to copy data (SEGFAULT)
ptr1(:,:,:) = ptr2(:,:,:)

Observances:

It seems not to be a direct problem of the size of memory in bytes, but of the number of elements. If I use complex type instead of real it segfaults occurrs with the same number of elements even though complex takes double memory.

Assumption:

I noticed that memory-address of mem3d and the ptrs are marked with "Sparse" if I debug using TotalView. I wonder if the intrinsic function accesses memory, which is not yet allocated because of the compressed data-format (Sparse) in memory.

System:

Dell PowerEdge 1950, 16 GByte RAM, 4x Intel Xeon CPU 5160 @ 3.00GHz
Fedora 6 (2.6.22.9-61.fc6 #1 SMP Thu Sep 27 18:07:59 EDT 2007 x86_64 x86_64 x86_64 GNU/Linux)
Intel Fortran Compiler 10.1.011 and 9.1 tested
no optimisation -g -check all -traceback -fp-stack-check
mem(129,128,256) => segfault
mem( 65, 64,128) => NO segfault

==============================================================================

Source:

program ptrbug

implicit none

integer :: mem_id, msize(3)
parameter(msize = (/129,128,256/)) !=> segfault
! parameter(msize = (/65,64,128/)) !=> no segfault

real(kind=8),dimension(:,:,:,:), allocatable, target :: mem3d
real(kind=8), dimension(:,:,:), pointer :: ptr1, ptr2

! complex(kind=8),dimension(:,:,:,:), allocatable, save, target :: mem3d
! complex(kind=8), dimension(:,:,:), pointer :: ptr1, ptr2

integer :: i,j,k,ierr
ierr = 0

! ---------------------------
! allocate dynamic memory and set pointer
! ---------------------------
write(*,*) 'allocate memory'

allocate(mem3d(msize(1),msize(2),msize(3),2), stat=ierr)
if(ierr.ne.0) write(*,*) 'cannot allocate mem for mem3d'
mem3d(:,:,:,:) = 0.d0

do mem_id=1, 2
call set_ptr( mem3d(1,1,1,mem_id), mem_id)
enddo

! ---------------------------
! print infos
! ---------------------------
write(*,*) 'ptr1-dims: '
write(*,*) ' lbound: ', lbound(ptr1,1), lbound(ptr1,2), lbound(ptr1,3)
write(*,*) ' ubound: ', ubound(ptr1,1), ubound(ptr1,2), ubound(ptr1,3)
write(*,*) ' asize: ', size(ptr1,1), size(ptr1,2), size(ptr1,3)

write(*,*) 'ptr2-dims: '
write(*,*) ' lbound: ', lbound(ptr2,1), lbound(ptr2,2), lbound(ptr2,3)
write(*,*) ' ubound: ', ubound(ptr2,1), ubound(ptr2,2), ubound(ptr2,3)
write(*,*) ' asize: ', size(ptr2,1), size(ptr2,2), size(ptr2,3)

! ---------------------------
! tests
! ---------------------------
write(*,*) 'test 1 (fill ptr1)'
ptr1(:,:,:) = 1

write(*,*) 'test 2 (fill ptr2)'
ptr2(:,:,:) = 2

write(*,*) 'test 3 (copy data using do-loops)'
do k=1,msize(3)
do j=1,msize(2)
do i=1,msize(1)
ptr1(i,j,k) = ptr2(i,j,k)
enddo
enddo
enddo

write(*,*) 'test 4 (copy data using intrinsic function)'
ptr1(:,:,:) = ptr2(:,:,:)

stop

contains

!========================================
!
! assign pointer to allocated dynamic 3d memory
!
!========================================
subroutine set_ptr(ref_mem3d, mem_id)
implicit none

! function args
real(kind=8),dimension( msize(1),msize(2),msize(3)), target :: ref_mem3d
! complex(kind=8), dimension( msize(1),msize(2),msize(3)), target :: ref_mem3d
integer, intent(in) :: mem_id

if(mem_id .eq. 1) then
ptr1 => ref_mem3d
else if(mem_id .eq. 2) then
ptr2 => ref_mem3d
endif

end subroutine set_ptr

end

Kevin_D_Intel · ‎10-07-2008

The compiler uses stack temporaries to accomplish the assignment of the form:

ptr1(:,:,:) = ptr2(:,:,:)

The program runs successfully when compiled with: -heap-arrays

Or if one increases the shell stack limit via:

For Bash/sh/ksh use: ulimit -s unlimited

For Csh use: limit stacksize unlimited

I do not believe this represents a bug with the compiler, but if it does I will post again.

View solution in original post

Kevin_D_Intel · ‎10-07-2008

The compiler uses stack temporaries to accomplish the assignment of the form:

ptr1(:,:,:) = ptr2(:,:,:)

The program runs successfully when compiled with: -heap-arrays

Or if one increases the shell stack limit via:

For Bash/sh/ksh use: ulimit -s unlimited

For Csh use: limit stacksize unlimited

I do not believe this represents a bug with the compiler, but if it does I will post again.

Jens_Henrik_Goebbert · ‎10-07-2008

Thanks for your fast answer. It works fine now. :) Great!

You are right, I am not sure to call that "bug", too. I would call it than a "feature request" to make the compiler check things like that and advice to use -heap-arrays, if the program was compiled with -check-all. It was very very tricky to track this, because even debugging did not help and I wasn't expecting an intrinsic function to fail silently.

And of course: thanks to Intel for the compiler

jimdempseyatthecove · ‎10-07-2008

Quoting - Kevin Davis (Intel)

The compiler uses stack temporaries to accomplish the assignment of the form:

ptr1(:,:,:) = ptr2(:,:,:)

The program runs successfully when compiled with: -heap-arrays

Or if one increases the shell stack limit via:

For Bash/sh/ksh use: ulimit -s unlimited

For Csh use: limit stacksize unlimited

I do not believe this represents a bug with the compiler, but if it does I will post again.

In this particular assignment stack temporaries are not required and should not be used.

What happens if you use

ptr1 = ptr2

i.e. remove both(:,:,:) from statement use "=" do not use "=>" on the aboveassignment

Jim Dempsey

Jens_Henrik_Goebbert · ‎10-08-2008

Quoting - jimdempseyatthecove

Quoting - Kevin Davis (Intel)

The compiler uses stack temporaries to accomplish the assignment of the form:

ptr1(:,:,:) = ptr2(:,:,:)

The program runs successfully when compiled with: -heap-arrays

Or if one increases the shell stack limit via:

For Bash/sh/ksh use: ulimit -s unlimited

For Csh use: limit stacksize unlimited

I do not believe this represents a bug with the compiler, but if it does I will post again.

In this particular assignment stack temporaries are not required and should not be used.

What happens if you use

ptr1 = ptr2

i.e. remove both (:,:,:) from statement use "=" do not use "=>" on the above assignment

Jim Dempsey

I removed (:,:,:) and compiled with "-g -check all -traceback -fp-stack-check" using 10.1.011, but

ptr1 = ptr2 segfaults silently, too.

Greetings
Jens

Kevin_D_Intel · ‎10-08-2008

The guidance from the Fortran front-end developer is that because the pointers (ptr1, ptr2) are of the same type they can easily be setup to point to overlapping memory, therefore from the front-end point of view the temp is required. While they said one might argue the back-end should be able to figure out no overlap exists and remove the temp creation and use, they doubted such was likely.

jimdempseyatthecove · ‎10-08-2008

Quoting - Kevin Davis (Intel)

The guidance from the Fortran front-end developer is that because the pointers (ptr1, ptr2) are of the same type they can easily be setup to point to overlapping memory, therefore from the front-end point of view the temp is required. While they said one might argue the back-end should be able to figure out no overlap exists and remove the temp creation and use, they doubted such was likely.

The underlaying copy function can examine the array descriptors to check for (or how) memory overlaps and select to call memcpy or memmove (or there equivilent). There is no need to allocate a temporary (either on stack or from heap) to perform this type of copy. Only when the descriptors are sufficently convoluted where it becomes difficult to make overlap determination possible would it then be required to use a temporary. In the original case posted, all the subscript declarations had no specified stride (stride=1) an the first two subscript declarations had the same indexing ranges. Therefore the runtime system, in this example,could easily determine non-overlapping memory and could select to call memcpy (or its equivilent).

Intel prides itself on optimizing compilers and I would think that this type of optimization would be expected. In your other optimization code (e.g. SSE) the code path determination ismore difficult of a process than something as trivial as examining array descriptors for overlaps.

Jim Dempsey

TimP · ‎10-08-2008

There are some loose ends in this business of optimizing possibly overlapping source and destination. In general, allowing for overlap requires memmove() type support rather than memcpy(). Vectorizing overlapping operations often requires unaligned loads, which haven't shown competitive performance on past Intel CPUs, so the compilers have avoided that entire area. ifort freqently allocates a temporary where it is concerned about overlap, rather than checking whether reversal may be needed. I'd like to see this change; loop reversal as an optimization technique has had recognized importance since f90 was invented, and prototype CPUs have good hardware support for it.

The C++ standard apparently leaves certain operations, e.g. copy(), undefined in case of overlap, yet the accepted STL implementations use memmove() for copy(), which is the worst of both worlds. Current g++ changed the implementation of STL and threw out Intel C++ optimization. In C++, we may frequently pay the penalty of allowing for overlap, without fully guaranteeing that it will work or get optimized. C++, rightly or wrongly, doesn't appear likely to adopt any standard equivalent of C restrict, so it continues to form a poor basis for optimizing Fortran run-time libraries.

jimdempseyatthecove · ‎10-08-2008

Tim,

As shown by the original post, the driving issue was not performance of the copy operation. Instead it was the Seg Fault introduced by the copy operation allocating stack space for the temporary. The primary goal of eliminating the temporary was to avoid the Seg Fault in the majority of cases. If you search this forum (and I am sure if you search the premier support section) for segfault (or seg fault) I am sure you will see a significant number of problem reports relating to temporaries. The point of the suggestion for more effort on eliminating the creation of the temporary is to reduce the number of problem report incidences for you as well as to reduce the number of users enduring the burden of resolving such problems.

Jim Dempsey

TimP · ‎10-08-2008

Jim,

I don't think we're at odds here. If a temporary allocation can be avoided in order to improve performance, it would also postpone (if lucky, indefinitely) accidental stack size overflows.

Jens_Henrik_Goebbert · ‎10-10-2008

so if I summariese: this is a compiler bug

I was expecting the fastest possible way (highly optimised) to copy data of ptr1 to ptr2. But because ptr1=ptr2 uses temp memory on the stack it is not save to use (segfault) and slow if ptr points to large arrays.

question: Is it faster to use do-loops to copy ptr-memory in cases where ptr1 and ptr2 are big arrays? Actually I wasn't expecting that.

do k=1, ...
do j=1, ...
do i=1, ...
ptr1(i,j,k) = ptr2(i,j,k)
enddo
enddo
enddo

Greetings
Jens

P.S: somehow my last posting got deleted - I could see it in the thread yesterday, but today it is gone ...

TimP · ‎10-10-2008

Quoting - Jens Henrik Goebbert

so if I summariese: this is a compiler bug

I was expecting the fastest possible way (highly optimised) to copy data of ptr1 to ptr2. But because ptr1=ptr2 uses temp memory on the stack it is not save to use (segfault) and slow if ptr points to large arrays.

question: Is it faster to use do-loops to copy ptr-memory in cases where ptr1 and ptr2 are big arrays? Actually I wasn't expecting that.

do k=1, ...
do j=1, ...
do i=1, ...
ptr1(i,j,k) = ptr2(i,j,k)
enddo
enddo
enddo

Greetings
Jens

P.S: somehow my last posting got deleted - I could see it in the thread yesterday, but today it is gone ...

Yes, the f77 code avoids the situations where ifort creates a temporary in order to perform a hidden double move. As others pointed out, there also are cases where ptr1 = ptr2 may be better optimized than ptr1(:,:,:) = ptr2(:,:,:), although ifort has improved in that respect.

Yes again, all posts in a certain period yesterday were deleted during a restore.

jimdempseyatthecove · ‎10-10-2008

Quoting - tim18

Yes, the f77 code avoids the situations where ifort creates a temporary in order to perform a hidden double move. As others pointed out, there also are cases where ptr1 = ptr2 may be better optimized than ptr1(:,:,:) = ptr2(:,:,:), although ifort has improved in that respect.

Yes again, all posts in a certain period yesterday were deleted during a restore.

If you can assert that the data elements of ptr1 and ptr2 are contiguous and non-overlapping then you can call the C function memcpy using the location of the 1st cell for each pointer and the size (number of bytes) to copy.

If you can assert that the data elements of ptr1 and ptr2 are contiguous but are uncertain if they overlap then you can call the C function memmove using the location of the 1st cell for each pointer and the size (number of bytes) to copy.

An alternate method using F90 would be to NOT use an interface and pass the reference of the first cell of each array to a subroutine that declares the dummy arguments as a single dimension array (and pass in the size). You may want one subrouting for non-overlapping and one for overlapping copy.

Jim Dempsey

Kevin_D_Intel · ‎10-10-2008

Idirected the test case and this discussion to our High-level optimization development team for their analysis and opinion on removing the stack temp creation and use. I will follow-up as I learn more. (Internal ref. CQ-50310)

Steven_L_Intel1 · ‎10-13-2008

Some comments.

First, if you use ALLOCATABLE rather than POINTER, the compiler will know there is no overlap and won't (usually) use a temp.

Second, the semantics of array assignment are not the same as the "F77 DO loop". Array assignment requires that the right side be completely evaluated before modifying the left side, which initially requires a temp. If the compiler can determine that there is no overlap, then the temp can be avoided. When using POINTER, it is more complicated than just checking overlap and copying in one direction or another as pointers can be non-contiguous. The compiler would have to generate run-time code to determine whether a temp is needed or not and then have code to do it both ways.

My advice is to use ALLOCATABLE unless you absolutely need the additional semantics of POINTER.

jimdempseyatthecove · ‎10-13-2008

Although ALLOCATABLE will fix the overlap and temporarry issue it will not give the user the flexability of a pointer. Two suggestions follow

allocate(mem3d(msize(1),msize(2),msize(3),2), stat=ierr)
! associating two 3d-pointer references with parts of allocated memory:
ptr1 = 1
ptr2 = 2
! fill ptr1 and ptr2 with some data
mem3d(:,:,:,ptr1) = 1
mem3d(:,:,:,ptr2) = 2
!loop over i,j,k
...
mem3d(i,j,k,ptr1) = mem3d(i,j,k,ptr2)
...
! use intrinsic function to copy data
mem3d(:,:,:,ptr1) = mem3d(:,:,:,ptr2)

--------------- Or soemthing like -----------------------

type Type_mem3d
real, allocatable, target :: mem3d(:,:,:)
end Type_mem3d

type(Type_mem3d), allocatable, target :: mem3ds(:)
type(Type_mem3d), pointer :: ptr1, ptr2
...
allocate(mem3ds(2), stat=ierr)
ptr1=>mem3ds(1)
ptr2=>mem3ds(2)
allocate(ptr1%mem3d(msize(1),msize(2),msize(3)), stat=ierr)
allocate(ptr2%mem3d(msize(1),msize(2),msize(3)), stat=ierr)
...
! fill ptr1 and ptr2 with some data
ptr1%mem3d(:,:,:) = 1
ptr2%mem3d(:,:,:) = 2
!loop over i,j,k
...
ptr1%mem3d(i,j,k) = ptr2%mem3d(i,j,k)
...
! use intrinsic function to copy data
ptr1%mem3d(:,:,:) = ptr2%mem3d(:,:,:)

And if a large amount of code is affecte, you may be able to use

#define ptr1 PTR1%mem3d
--------------- Or soemthing like -----------------------

type Type_mem3d
real, allocatable, target :: mem3d(:,:,:)
end Type_mem3d

type(Type_mem3d), allocatable, target :: mem3ds(:)
type(Type_mem3d), pointer :: PTR1, PTR2
...
allocate(mem3ds(2), stat=ierr)
PTR1=>mem3ds(1)
PTR2=>mem3ds(2)
allocate(ptr1(msize(1),msize(2),msize(3)), stat=ierr)
allocate(ptr2(msize(1),msize(2),msize(3)), stat=ierr)
...
! fill ptr1 and ptr2 with some data
ptr1(:,:,:) = 1
ptr2(:,:,:) = 2
!loop over i,j,k
...
ptr1(i,j,k) = ptr2(i,j,k)
...
! use intrinsic function to copy data
ptr1(:,:,:) = ptr2(:,:,:)

Jim Dempsey

Jens_Henrik_Goebbert · ‎10-13-2008

thanks for your help. My example was only written to show the bug - the real program is much longer and more complicated (and it works fine with xlf).

I have to think about your sample code tomorrow (in Germany its midnight) and see if it is easy to implement. So you think I can avoid the problem, if I somehow remove the one big allocation of memory, but stick to the pointer idea (?)

Some infos of the "real big" program
Dependent on the input-file the program has to allocate a different number of working-arrays. These working arrays are used in the rest of the program to solve different equations (using fast fourier transformations).

First using pointer makes it possible to deside on runtime how many working arrays have to be allocated. Secound it is much easier (and therefor less mistakes), if I can "rename" the arrays and make the (very long) mathematical code more readable when reusing one array for something totally different in some other routine.

I have never programed any compiler, but I thought it could be possible to set at runtime some non-contiguous/contiguous-flag to a pointer and check for overlap befor copying. It would be nice to know, how much faster a pure memcpy would be. That would speed up a lot of F90 programs :)

jimdempseyatthecove · ‎10-14-2008

Jens,

An alternate technique not listed in my prior post might look something like

type Type_Array3D
real, allocatable :: a(:,:,:)
end Type_Array3D

type(Type_Array3D), allocatable :: ptr(:)
...
allocate(ptr(NumberOfArrays))
do i=1,NumberOfArrays
allocate(ptr(i)%a(dim(1), dim(2), dim(3))
end do
...
ptr1 = 1
ptr2 = 2
ptr(ptr1)%a = ptr(ptr2)%a

--------------- OR something like --------

type Type_Array3D
real, allocatable, target:: a(:,:,:)
end Type_Array3D

type pType_Array3D
type(Type_Array3D), pointer :: p
end type pType_Array3D

type(pType_Array3D), allocatable :: ptr(:)
...
allocate(ptr(NumberOfArrays))
do i=1,NumberOfArrays
allocate(ptr(i)%p)
allocate(ptr(i)%p%a(dim(1), dim(2), dim(3))
end do
...
ptr1 = 1
ptr2 = 2
ptr(ptr1)%p%a = ptr(ptr2)%p%a

--------------------------

Yes the syntax looks uggly, butconsider

call Foo(ptr(ptr1)%p%a) ! orFoo(ptr(ptr1)%a)
...
subroutine Foo(a)
real :: a(:,:,:)
...

all the decorations go away.

When using the #define technique to hide the "ptr(ptr1)%p%a" you then have to accept that the debugger will not know how to perform the macro expansion so examinig variables/arrays will require a little more work (but not when passed as dummy argument).

If you make some effort to convert your code for parallel programming then you will likely fragment a large loop into multiple pieces that can be distributed to multiple threads. If you write to use calls to the fragment (as opposed to !$OMP PARALLEL SECTION) then the "ptr(ptr1)%p%a" is expressed as the dummy argument "a" and your original code becomes easier to read (potentially the same code as was before the ptr1, ptr2 conversion).

Jim Dempsey

jimdempseyatthecove · ‎10-14-2008

Quoting - Steve Lionel (Intel)

Some comments.

First, if you use ALLOCATABLE rather than POINTER, the compiler will know there is no overlap and won't (usually) use a temp.

Second, the semantics of array assignment are not the same as the "F77 DO loop". Array assignment requires that the right side be completely evaluated before modifying the left side, which initially requires a temp. If the compiler can determine that there is no overlap, then the temp can be avoided. When using POINTER, it is more complicated than just checking overlap and copying in one direction or another as pointers can be non-contiguous. The compiler would have to generate run-time code to determine whether a temp is needed or not and then have code to do it both ways.

My advice is to use ALLOCATABLE unless you absolutely need the additional semantics of POINTER.

Steve,

When looking at my suggestions to Jens as to how to work around the POINTER vs. ALLOCATABLE issue (to avoid possibility of stack temporary) you see there is a considerable amount of effort required. The POINTER is elegant, the ALLOCATABLE is uggly (array of pointers to allocatable arrays).

I would like to suggest something for you to pass on to the standards committee (it may already be suggested), that there be a new class of pointer attribute that either

a) declares and enforces a pointer => assignment to a contiguous area of an array (static or allocatable)
b) declaresa pointer as a contiguous area of an array (static or allocatable)

As to how this might be done

real, allocatable, target:: BigArray(:,:,:,:)
real, pointer, allocatable :: ArraySection(:,:,:)
...
ArraySection(:,:,:) => BigArray(:,:,:,N)

This does not require a new keyword.

Jim Dempsey

Steven_L_Intel1 · ‎10-14-2008

I don't understand why you say ALLOCATABLE is "ugly". While one cannot have an array of pointers nor of allocatables, you can have an array of derived type with either pointer or allocatable components. Allocatable is better as the semantics are much nicer, especially on assignment. It's POINTER that is ugly in my view. It used to be that people used POINTER because there were so many limitations on where ALLOCATABLE could be used, but that is no longer an issue as of F03 (and most current implementations including ifort.)

Allocatable arrays are always contiguous unless a subscript explicitly makes a reference non-contiguous. With pointers, you can't tell except at run-time. We do in fact have a bit in our array descriptor that is set when the compiler determines that the pointed-to storage is contiguous, but this is used only when passing a pointer as an argument to a routine expecting a contiguous array.

I will also repeat my past recommendation to omit the (:,:,:) if you mean the whole array.

I do not understand what your proposed syntax would add.

If you want "dynamic arrays", use ALLOCATABLE. If you want to be able to have pointers that can point to parts of other arrays, use POINTER.

jimdempseyatthecove · ‎10-14-2008

Quoting - Steve Lionel (Intel)

I don't understand why you say ALLOCATABLE is "ugly". While one cannot have an array of pointers nor of allocatables, you can have an array of derived type with either pointer or allocatable components. Allocatable is better as the semantics are much nicer, especially on assignment. It's POINTER that is ugly in my view. It used to be that people used POINTER because there were so many limitations on where ALLOCATABLE could be used, but that is no longer an issue as of F03 (and most current implementations including ifort.)

Allocatable arrays are always contiguous unless a subscript explicitly makes a reference non-contiguous. With pointers, you can't tell except at run-time. We do in fact have a bit in our array descriptor that is set when the compiler determines that the pointed-to storage is contiguous, but this is used only when passing a pointer as an argument to a routine expecting a contiguous array.

I will also repeat my past recommendation to omit the (:,:,:) if you mean the whole array.

I do not understand what your proposed syntax would add.

If you want "dynamic arrays", use ALLOCATABLE. If you want to be able to have pointers that can point to parts of other arrays, use POINTER.

If you look at reply #17 you will see why an array of user defined types containing an allocatable array or an array of pointers to user defined type containing an allocatable array gets Uggly in the code expressions used to reference the data (not uggly in the sense of the code generated).

In your Uggly pointer case you refer to ptr1(i,j,k) = ptr2(i,j,k) as being uggly due to either the amount of work the runtime has to make for overlapped copy test or uggly due to unnecessary stack temporary being created. However, the source code looks are not uggly, it is nice and clean.

In the array of user defined types containing an allocatable array or an array of pointers to user defined type containing an allocatable array situation the user sorce code looks uggly

ptr(ptr1)%p%a(i,j,k) = ptr(ptr2)%p%a(i,j,k)

My suggestion was in two parts

a) Introduce some means where a pointer can have an attribute that declaresit can only point to (NULL or) a contiguous subset of an array (static or dynamic)
b) A suggested way of doing this without introducing a new keyword.

If a pointer is known to reference a contiguous section of memory then the runtime system has a relatively simple test for overlapping (or can simply call memmove if the compiler writer doesn't want to insert the overlap test). The runtime test at the time of the => should be relatively trivial (assuming user using variables as opposed to (:,:,:,N) or whatever). In Jens' sample code there would be no requirement for a runtime check as the compiler could verify => as being to contiguous memory.

What Jens is doing is not all that unusual. He should have an optimal way of achieving his goal with minimal programming effort.

Jim Dempsey