[OpenMP] SpeedUp for array initialization

Sebi_G_ · ‎07-18-2014

Hi all,

I'm investigating a problem that I could boild down to basically the array initialization - which does not really seem to give me any speedup on Windows OS.

Consider the following piece of code:

integer, dimension(:), allocatable :: a
integer :: dim=8000000
integer i

allocate(a(1:dim))

!$OMP PARALLEL DO
do i=1,dim
   a(i)=0
enddo
!$OMP END PARALLEL DO

Then I do virtually get no speedup out of the parallel region.

Is this intended to be, or am I doing something stupid wrong ?

I was thinking that there are different NUMA-issues on Windows making the First-touch policy from Linux OS being "less valid" on Windows. However, I do observe the problem also when limiting myself to a single socket.

jimdempseyatthecove · ‎07-18-2014

Assuming you compiled with Generate Parallel code, you likely can hit the memory bandwidth limitation with one thread.

a(i) = ExpressionWithSeveralOperationsHere

would likely not be memory bandwidth limited, and thus observe benefit from parallelization.

I forgot to mention, on first allocation, the virtual memory may be mapped to the page file but not necessarily mapped to RAM. On system with first touch policy, each time you first enter a page, you will incur the overhead of either mapping virtual memory to physical RAM (assuming some available) or paging out something else and then mapping virtual memory to physical RAM. In this scenario, parallel code may see some improvement.

Jim Dempsey

Sebi_G_ · ‎07-21-2014

Dear Jim,

thanks for your comments. In fact, you're right: the speedup on a system with first touch policy (i.e. Linux in my case) is looking fine.

To be more precise, my speedups are like this:
Linux: Nearly perfect
Windows: A very small speedup on a single socket, absolutely no speedup as soon as two or more sockets are involved.

In some parts of the code, the initialization phase can this way become the most time consuming part. Although I accept Amdahl here, I do not really accept why I see this differnt behaviour. Am I really not overseeing some detail ?

TimP · ‎07-21-2014

Windows versions beginning with windows 7 sp1 included suffiicient affinity support for first touch. Of course, all that is closed source, so we have to take their word, as well as observing the consequences. The effect on either linux or Windows depends on setting affinities (at least OMP_PROC_BIND), Windows may require a larger setting of KMP_BLOCKTIME to maintain affinity than linux.

Effective fiirst touch doesn't speed up initialization; it speeds up later access with consistent affinity. If you don't set affinity like OMP_PROC_BIND=close, the chunks of the array will be scattered somewhat randomly during first touch, and of course not the same for Windows and linux.

jimdempseyatthecove · ‎07-21-2014

Sebi,

Instead of looking at the scaling of speed-ups, which is faster?

IIF (two II's) Linux is first touch, page by page, small page size, then each touch has a very high latency that is not memory bandwidth bound. Therefore, this will "scale" well although it is abhorrently slower than making the allocation (and mapping) at one time. Note, some of the O/S page file managers automatically wipe the associated (mapped) RAM to Page in Page file (to prevent you from snooping on stale data from a different process). Therefore, the initialization (to 0's) may be redundant but your application cannot rely upon this being so.

Next, IIF your system has a first touch policy .AND. your application can take advantage of it .THEN. the initialization to 0 must be written very carefully. The thread partitioning of the initialization of the arrays must align with the thread positioning of manipulating the arrays. You likely require static scheduling and same number of threads (and same nest level if nested parallel regions used). Further, depending on array size and number of threads, it might not be advisable to have !$OMP PARALLEL DO as your initialization. Instead, it may be more beneficial to use !$OMP PARALLEL then have each thread determine who benefits most by owning the page. IOW the PARALLEL DO will almost always not partition iteration space at page boundaries. You will want the thread that "owns" the most data within the page to have first touch.

Jim Dempsey

Sebi_G_ · ‎07-22-2014

Thanks for your answers, Steve and Jim,

for sure you're right that first touch also on Windows gives me benefits for the later computation. However, because the initialization does not give any speedup, for some of my routines it suddenly becomes the bottleneck - because the computations speed up quite well.

In the overall code that's of course only visible to a small extent, however, it still is visible compared to Linux.

To answer Jim's first point:
Sure, you're right the parallel speedup is not all that matters. However, when comparing serial runs, Windows and Linux give me nearly the same performance. So, the only difference is that Linux gives me a nearly linear speedup for the above initialization routine - and hence the initialization is not becoming my routine's hotspot, while Windows doesn't scale.

For instance on 16 threads (two octacore Sandy Bridge processors, either scatter or compact pinning), Linux is roughly 10 to 11 times faster with the initialization than Windows, while on one thread their runtime is in the same range (Windows is about 10% slower in serial).

jimdempseyatthecove · ‎07-22-2014

Sebi,

I think if you look into this further (in the event you are interested in doing his), the performance issue has to deal with how the application interacts with the system as it manipulates the page file. An additional factor is the page size used by the virtual memory system. Smaller page size == better granularity at the expense of mapping (1000:1 ratio).

This is not necessarily an issue of Linux verses Windows, rather of the programmer not in the know.

See: Large Page Support

Also, if you are benchmarking the Debug builds, you are likely linking in the Debug C RTL heap manager which runs much slower as it is doing sanity checks and special signature checks and initializations.

Jim Dempsey

TimP · ‎07-23-2014

Recent linux supports "transparent huge pages" which might be a performance enhancement in favor of linux, as it doesn't require elevated privilege nor source code modification. You could test to see whether this has accelerated your application by booting temporarily with it turned off by transparent_hugepage=never

As the reference Jim provided indicates, on Windows it would require elevated privilege and specification in a VirtualAlloc call.

OpenMP speedup for the initialization itself depends on the work being distributed evenly across memory controllers, so may be expected to require suitable affinity for reproducible performance. I suppose this may involve separate huge pages for NUMA memory locality, so there could be issues if that prevents dividing up the array evenly or if unable to allocate a huge page in one of the memory banks. This would be one of the reasons for running benchmarks immediately after reboot.

These differences between Windows and linux are outside the control of the OpenMP library and may be associated with differing points of view held by the operating system designers (and maybe the larger pool of people tinkering with linux to find attractive modifications).

jimdempseyatthecove · ‎07-24-2014

On time critical applications, you typically want to perform the allocations, and first touch, at program start-up, and do it once. Thereafter you reuse the array, reinitializing it as necessary.

subroutine foo
integer, dimension(:), allocatable, SAVE :: a
integer :: dim=8000000
integer i

if(.not. allocated(a)) allocate(a(1:dim))

!$OMP PARALLEL DO
do i=1,dim
   a(i)=0
enddo
!$OMP END PARALLEL DO

The second time you call foo, the initialization will be fast(er)

alternatively, init at program start:

module preallocated_data
integer, dimension(:), allocatable :: a
integer :: dim=8000000
...
contains
subroutine preallocated_data_init()
   allocate(a(1:dim))
   allocate(...)
end subroutine preallocated_data_init
end module  preallocated_data
...
program your_program
  use preallocated_data
  ...
  call preallocated_data_init
  ...
end program your_program
...
subroutine foo
  use preallocated_data
  integer i

!$OMP PARALLEL DO
  do i=1,dim
     a(i)=0
  enddo
!$OMP END PARALLEL DO
...

Jim Dempsey