- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I am working on profiling and optimising an application. The application is written in Fortran 2003 and I am using the Intel ifort (IFORT) 16.0.0 20150815 version to build. I have two questions.
In the my profile the following code which has assignment of a pointer takes more than 10 secs (total application runs for 200 seconds). The profile of this function increases with number of threads. The compiler seems to be doing more than just pointer assignment.
My profile shows prominently a function __intel_ssse3_rep memcpy ( or _wordcopy_fwd_aligned ) when compiled with O3 (or O1) optimisation option. What is happening here?
function get_master_dofmap(self,cell) result(map) implicit none class(master_dofmap_type), target, intent(in) :: self integer, intent(in) :: cell integer, pointer :: map(:) map => self%dofmap(:,cell) return end function get_master_dofmap
Also I see that for calls to functions with pointers to array as arguments, the compiler warns that temporary arrays are created.
forrtl: warning (406): fort: (1): In call to COORDINATE_JACOBIAN, an array temporary was created for argument #6
subroutine coordinate_jacobian(ndf, ngp_h, ngp_v, chi_1, chi_2, chi_3, diff_basis, jac, dj)
!-------------------------------------------------------------------------------
! Compute the Jacobian J^{i,j} = d chi_i / d \hat{chi_j} and the
! derterminant det(J)
!-------------------------------------------------------------------------------
integer, intent(in) :: ndf, ngp_h, ngp_v
real(kind=r_def), intent(in) :: chi_1(ndf), chi_2(ndf), chi_3(ndf)
real(kind=r_def), intent(in) :: diff_basis(3,ndf,ngp_h,ngp_v)
real(kind=r_def), intent(out) :: jac(3,3,ngp_h,ngp_v)
real(kind=r_def), intent(out) :: dj(ngp_h,ngp_v)
call coordinate_jacobian( ndf, &
1, &
1, &
chi_cell(1,:), &
chi_cell(2,:), &
chi_cell(3,:), &
dgamma, &
jac, &
dj)
How can I avoid them?
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I very much doubt that it is the pointer assignment that is taking the time, but something else. It might be instructive to generate an assembly listing (-S) and look at the generated code for get_master_dofmap. Is the actual argument corresponding to "self" itself a class variable or is it a "type"? If the latter, the compiler has to construct a class descriptor and this is a lot of code.
In the case of the array temporary warning, usually the compiler generates code to test whether or not the actual argument is contiguous, and gives this warning only if it isn't. Since you didn't show us how it was called (at the point where the message was given), it's hard to speculate further. In some cases, pointer arrays can be replaced with allocatable arrays, if all you want is dynamic allocation, and this can help the compiler understand contiguity. Another option is to make the dummy arguments assumed-shape (:).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Steve,
Please find attached the assembly listing of the function.
Just one more thing about __intel_ssse3_rep memcpy ( or _wordcopy_fwd_aligned ). why does this function gets called ? Is it to do with unaligned arrays or to prevent aliasing of pointers. I guess it has nothing to do with vectorisation as it shows up in the profile when I use -no-vec.
Thanks and Regards,
Karthee S
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
There's no call to the memcpy or wordcopy routines in that assembly file. The get_master_dofmap function is simply a series of move instructions that set up the pointer descriptor - no calls at all.
The memcpy and wordcopy routines are copying memory. These are optimized versions that can take advantage of advanced instructions on Intel processors that support them. There's no connection I know of with unaligned accesses or alias prevention (which is entirely up to the programmer.)
At this point looking at small pieces of the code, out of context, is not going to be further enlightening.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I don't think your quoting assembly listing of debug symbols is shedding much light on this.
If a temporary array is created, in order to assure a contiguous copy, it may not be surprising if the compiler chooses a memcpy to make a copy of it before deallocation. When the compiler chooses a memcpy, vectorization is inherent so it doesn't become eligible for auto-vectorization. There are situations where !dir$ simd may be used to suppress allocation of a temporary with memcpy and thus bring auto-vectorization into play, but I don't see that you are showing such a case.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sorry for misleading you. I think that they are completely different issues. My guess was that the pointer assignment is copying arrays which may call __intel_ssse3_rep_memmove. I am using the intel_vtune to get my results.
1. The get_master_dofmap calls don't scale with the number of threads. This function has only a pointer assignment and I am puzzled.
2. The full profile of the my application shows call to __intel_ssse3_rep_memmove that dominate the profile. These calls are always serialised (when I use more OPENMP threads). I cannot find this in my assembly files but _intel_fast_memmove.
Unfortunately I cannot share the full code and I will try to setup simple examples and profile it again.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I suppose the fast_memmove may choose at run time to call ssse3_rep_memmove, depending on the platform etc. I'm not surprised if it doesn't internally spawn multiple threads, if that's what you mean. If the data movement involves going all the way to memory, a single thread with full width loads and stores may be able to deliver most of the performance on a single CPU, and spawning multiple threads might involve questions of memory and cache locality.
The difference between memmove and memcpy, as you would expect by the C library analogy, should be in whether it takes the time to check for an overlap between source and destination. One would think that an implicitly generated temporary would not involve overlap.
If your function is allocating and deallocating memory, that may be enough to limit threaded scaling.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You won't find __intel_ssse3_rep_memmove in your assembly - that is called by _intel_fast_memmove on Intel SSSE3-capable processors.
There's very little going on in get_master_dofmap. There is absolutely no data copying in that routine. In fact, it's so short that any references to it in profiling should be suspect.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If the subroutine you call has an unknown interface, or if the interface is known .AND. the interface specifies (implicitly) the array slice is contiguous, then a copy of the array will be made with allocation (potentially) on heap which is serialized (allocate/deallocate have critical section). To avoid this, the called routine can specify the dummy as assumed shape (IOW with :'s) or as pointer with :'s, and this requires an interface to the subroutine/function. However, note, this will thwart vectorizing code using those arrays. (win some - lose some)
Jim Dempsey
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page