Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.

Pointer assignment

Karthee_S_
Beginner
481 Views

 

 

Hi,

I am working on profiling and optimising an application. The application is written in Fortran 2003 and I am using the Intel ifort (IFORT) 16.0.0 20150815 version to build. I have two questions.

In the my profile the following code which has assignment of a pointer takes more than 10 secs (total application runs for 200 seconds). The profile of this function increases with number of threads. The compiler seems to be doing more than just pointer assignment.  

My profile shows prominently a function __intel_ssse3_rep memcpy ( or _wordcopy_fwd_aligned ) when compiled with O3 (or O1) optimisation option. What is happening here?

function get_master_dofmap(self,cell) result(map)
  implicit none
  class(master_dofmap_type), target, intent(in) :: self
  integer,                           intent(in) :: cell
  integer, pointer                              :: map(:)

  map => self%dofmap(:,cell)
  return
end function get_master_dofmap

 

Also I see that for calls to functions with pointers to array as arguments, the compiler warns that temporary arrays are created.

forrtl: warning (406): fort: (1): In call to COORDINATE_JACOBIAN, an array temporary was created for argument #6

subroutine coordinate_jacobian(ndf, ngp_h, ngp_v, chi_1, chi_2, chi_3, diff_basis, jac, dj)
!-------------------------------------------------------------------------------
! Compute the Jacobian J^{i,j} = d chi_i / d \hat{chi_j} and the 
! derterminant det(J)
!-------------------------------------------------------------------------------

integer,          intent(in)  :: ndf, ngp_h, ngp_v
real(kind=r_def), intent(in)  :: chi_1(ndf), chi_2(ndf), chi_3(ndf)
real(kind=r_def), intent(in)  :: diff_basis(3,ndf,ngp_h,ngp_v)
real(kind=r_def), intent(out) :: jac(3,3,ngp_h,ngp_v)
real(kind=r_def), intent(out) :: dj(ngp_h,ngp_v)
call coordinate_jacobian( ndf, &
                              1,   &
                              1,   &
                              chi_cell(1,:), &
                              chi_cell(2,:), &
                              chi_cell(3,:), &
                              dgamma, &
                              jac, &
                              dj)


How can I avoid them?

0 Kudos
8 Replies
Steven_L_Intel1
Employee
481 Views

I very much doubt that it is the pointer assignment that is taking the time, but something else. It might be instructive to generate an assembly listing (-S) and look at the generated code for get_master_dofmap. Is the actual argument corresponding to "self" itself a class variable or is it a "type"? If the latter, the compiler has to construct a class descriptor and this is a lot of code.

In the case of the array temporary warning, usually the compiler generates code to test whether or not the actual argument is contiguous, and gives this warning only if it isn't. Since you didn't show us how it was called (at the point where the message was given), it's hard to speculate further. In some cases, pointer arrays can be replaced with allocatable arrays, if all you want is dynamic allocation, and this can help the compiler understand contiguity. Another option is to make the dummy arguments assumed-shape (:).

0 Kudos
Karthee_S_
Beginner
481 Views

Hi Steve,

Please find attached the assembly listing of the function. 

Just one more thing  about  __intel_ssse3_rep memcpy ( or _wordcopy_fwd_aligned ). why does this function gets called ? Is it to do with unaligned arrays or to prevent aliasing of pointers. I guess it has nothing to do with vectorisation as it shows up in the profile when I use -no-vec.

Thanks and Regards,

Karthee S

 

0 Kudos
Steven_L_Intel1
Employee
481 Views

There's no call to the memcpy or wordcopy routines in that assembly file. The get_master_dofmap function is simply a series of move instructions that set up the pointer descriptor - no calls at all.

The memcpy and wordcopy routines are copying memory. These are optimized versions that can take advantage of advanced instructions on Intel processors that support them. There's no connection I know of with unaligned accesses or alias prevention (which is entirely up to the programmer.)

At this point looking at small pieces of the code, out of context, is not going to be further enlightening.

0 Kudos
TimP
Honored Contributor III
481 Views

I don't think your quoting assembly listing of debug symbols is shedding much light on this.

If a temporary array is created, in order to assure a contiguous copy, it may not be surprising if the compiler chooses a memcpy to make a copy of it before deallocation.  When the compiler chooses a memcpy, vectorization is inherent so it doesn't become eligible for auto-vectorization.  There are situations where !dir$ simd may be used to suppress allocation of a temporary with memcpy and thus bring auto-vectorization into play, but I don't see that you are showing such a case.

0 Kudos
Karthee_S_
Beginner
481 Views

Sorry for misleading you. I think that they are completely different issues. My guess was that the pointer assignment is copying arrays which may call __intel_ssse3_rep_memmove. I am using the intel_vtune to get my results.

1. The get_master_dofmap calls don't scale with the number of threads. This function has only a pointer assignment and I am puzzled.

2. The full profile of the my application shows call to __intel_ssse3_rep_memmove that dominate the profile. These calls are always serialised (when I use more OPENMP threads).  I cannot find this in my assembly files but _intel_fast_memmove. 

 Unfortunately I cannot share the full code and I will try to setup simple examples and profile it again.

0 Kudos
TimP
Honored Contributor III
481 Views

I suppose the fast_memmove may choose at run time to call ssse3_rep_memmove, depending on the platform etc.  I'm not surprised if it doesn't internally spawn multiple threads, if that's what you mean.  If the data movement involves going all the way to memory, a single thread with full width loads and stores may be able to deliver most of the performance on a single CPU, and spawning multiple threads might involve questions of memory and cache locality.

The difference between memmove and memcpy, as you would expect by the C library analogy, should be in whether it takes the time to check for an overlap between source and destination.  One would think that an implicitly generated temporary would not involve overlap.

If your function is allocating and deallocating memory, that may be enough to limit threaded scaling.

0 Kudos
Steven_L_Intel1
Employee
481 Views

You won't find __intel_ssse3_rep_memmove in your assembly - that is called by _intel_fast_memmove on Intel SSSE3-capable processors.

There's very little going on in get_master_dofmap. There is absolutely no data copying in that routine. In fact, it's so short that any references to it in profiling should be suspect.

0 Kudos
jimdempseyatthecove
Honored Contributor III
481 Views

If the subroutine you call has an unknown interface, or if the interface is known .AND. the interface specifies (implicitly) the array slice is contiguous, then a copy of the array will be made with allocation (potentially) on heap which is serialized (allocate/deallocate have critical section). To avoid this, the called routine can specify the dummy as assumed shape (IOW with :'s) or as pointer with :'s, and this requires an interface to the subroutine/function. However, note, this will thwart vectorizing code using those arrays. (win some - lose some)

Jim Dempsey

0 Kudos
Reply