I should also add that in my

Glynn_H_ · ‎07-16-2015

I'm running Intel Fortran 14 on Linux.

In my code I have three allocatable 2D arrays (one integer and the other two logical) that are defined in a module header, then allocated before calling subroutine X. The arrays are then initialised (to zero and false) at the start of X. Profiling tools like gprof and callgrind are indicating that the time to initialise the arrays is significantly longer that the actual runtime of the calculation in X, which seems like nonsense seeing that the arrays are small (50x50). I don't see any performance issues with array initialisation elsewhere in the code.

Could anyone offer any suggestions as to what may be happening?

Glynn_H_ · ‎07-16-2015

Should have added that the array initialisation is simple, e.g.

A=0

B=.false.

FortranFan · ‎07-16-2015

What happens if you combine allocation and initialization via the sourced allocation facility in Fortran 2003:

allocate( A(n,m), source=0, stat=istat, ..)
..
allocate( B(n,m), source=.false., stat=istat, ..)

john_e · ‎07-16-2015

Could you show us the code and compile options for this section.

The fact that you say performance is not a problem for other initialisations in your code indicates something else may also be occurring.

john_e · ‎07-16-2015

I should also add that in my experience that doing this

allocate( A(n), source=0)

is slower than this

allocate( A(n) )
A = 0

at least it was in the quick test I just ran now to confirm this. compiler is 15.0.3.187 on Linux. If speed is an issue, perhaps parallelise ?

Again in my experience, once your vector dimension exceeds about 15,000 and 4 cpus, an openmp loop can do this faster. No idea why those values either.

jimdempseyatthecove · ‎07-17-2015

The "problem" with array initialization has to deal with at process start, while the Virtual Machine of the process may have all the addresses to use, these addresses are not mapped until "first touched". This means, the first time allocation of a region of addresses for the processes (at page granularity) are not mapped to physical RAM and/or page file, until you first access the memory (typically by write first). First touch takes a relatively long time: fault to the O/S for accessing memory not mapped, O/S locating available page in RAM, possibly swapping out something else, O/S optionally may wipe the page to circumvent inter-process snooping, page file remapping may be required as well, then return the user application.

Take a look at the early part of the video (right side) on http://www.lotsofcores.com/. The initial combing effect of the display reflects the first touch overhead. The left side the effect is there but visually not apparent (other than time lag).

Jim Dempsey

TimP · ‎07-17-2015

As the last 2 responses suggest, with a large array , openmp default schedule should show advantage on a multi CPU numa platform, particularly if subsequent use of the array is scheduled consistent with memory locality. Intel compilers may engage opt-streaming-stores auto if it appears appropriate.

Slow Array Initialisation