- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi all,
I'm investigating a problem that I could boild down to basically the array initialization - which does not really seem to give me any speedup on Windows OS.
Consider the following piece of code:
integer, dimension(:), allocatable :: a integer :: dim=8000000 integer i allocate(a(1:dim)) !$OMP PARALLEL DO do i=1,dim a(i)=0 enddo !$OMP END PARALLEL DO
Then I do virtually get no speedup out of the parallel region.
Is this intended to be, or am I doing something stupid wrong ?
I was thinking that there are different NUMA-issues on Windows making the First-touch policy from Linux OS being "less valid" on Windows. However, I do observe the problem also when limiting myself to a single socket.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Assuming you compiled with Generate Parallel code, you likely can hit the memory bandwidth limitation with one thread.
a(i) = ExpressionWithSeveralOperationsHere
would likely not be memory bandwidth limited, and thus observe benefit from parallelization.
I forgot to mention, on first allocation, the virtual memory may be mapped to the page file but not necessarily mapped to RAM. On system with first touch policy, each time you first enter a page, you will incur the overhead of either mapping virtual memory to physical RAM (assuming some available) or paging out something else and then mapping virtual memory to physical RAM. In this scenario, parallel code may see some improvement.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear Jim,
thanks for your comments. In fact, you're right: the speedup on a system with first touch policy (i.e. Linux in my case) is looking fine.
To be more precise, my speedups are like this:
Linux: Nearly perfect
Windows: A very small speedup on a single socket, absolutely no speedup as soon as two or more sockets are involved.
In some parts of the code, the initialization phase can this way become the most time consuming part. Although I accept Amdahl here, I do not really accept why I see this differnt behaviour. Am I really not overseeing some detail ?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Windows versions beginning with windows 7 sp1 included suffiicient affinity support for first touch. Of course, all that is closed source, so we have to take their word, as well as observing the consequences. The effect on either linux or Windows depends on setting affinities (at least OMP_PROC_BIND), Windows may require a larger setting of KMP_BLOCKTIME to maintain affinity than linux.
Effective fiirst touch doesn't speed up initialization; it speeds up later access with consistent affinity. If you don't set affinity like OMP_PROC_BIND=close, the chunks of the array will be scattered somewhat randomly during first touch, and of course not the same for Windows and linux.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sebi,
Instead of looking at the scaling of speed-ups, which is faster?
IIF (two II's) Linux is first touch, page by page, small page size, then each touch has a very high latency that is not memory bandwidth bound. Therefore, this will "scale" well although it is abhorrently slower than making the allocation (and mapping) at one time. Note, some of the O/S page file managers automatically wipe the associated (mapped) RAM to Page in Page file (to prevent you from snooping on stale data from a different process). Therefore, the initialization (to 0's) may be redundant but your application cannot rely upon this being so.
Next, IIF your system has a first touch policy .AND. your application can take advantage of it .THEN. the initialization to 0 must be written very carefully. The thread partitioning of the initialization of the arrays must align with the thread positioning of manipulating the arrays. You likely require static scheduling and same number of threads (and same nest level if nested parallel regions used). Further, depending on array size and number of threads, it might not be advisable to have !$OMP PARALLEL DO as your initialization. Instead, it may be more beneficial to use !$OMP PARALLEL then have each thread determine who benefits most by owning the page. IOW the PARALLEL DO will almost always not partition iteration space at page boundaries. You will want the thread that "owns" the most data within the page to have first touch.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for your answers, Steve and Jim,
for sure you're right that first touch also on Windows gives me benefits for the later computation. However, because the initialization does not give any speedup, for some of my routines it suddenly becomes the bottleneck - because the computations speed up quite well.
In the overall code that's of course only visible to a small extent, however, it still is visible compared to Linux.
To answer Jim's first point:
Sure, you're right the parallel speedup is not all that matters. However, when comparing serial runs, Windows and Linux give me nearly the same performance. So, the only difference is that Linux gives me a nearly linear speedup for the above initialization routine - and hence the initialization is not becoming my routine's hotspot, while Windows doesn't scale.
For instance on 16 threads (two octacore Sandy Bridge processors, either scatter or compact pinning), Linux is roughly 10 to 11 times faster with the initialization than Windows, while on one thread their runtime is in the same range (Windows is about 10% slower in serial).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sebi,
I think if you look into this further (in the event you are interested in doing his), the performance issue has to deal with how the application interacts with the system as it manipulates the page file. An additional factor is the page size used by the virtual memory system. Smaller page size == better granularity at the expense of mapping (1000:1 ratio).
This is not necessarily an issue of Linux verses Windows, rather of the programmer not in the know.
See: Large Page Support
Also, if you are benchmarking the Debug builds, you are likely linking in the Debug C RTL heap manager which runs much slower as it is doing sanity checks and special signature checks and initializations.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Recent linux supports "transparent huge pages" which might be a performance enhancement in favor of linux, as it doesn't require elevated privilege nor source code modification. You could test to see whether this has accelerated your application by booting temporarily with it turned off by transparent_hugepage=never
As the reference Jim provided indicates, on Windows it would require elevated privilege and specification in a VirtualAlloc call.
OpenMP speedup for the initialization itself depends on the work being distributed evenly across memory controllers, so may be expected to require suitable affinity for reproducible performance. I suppose this may involve separate huge pages for NUMA memory locality, so there could be issues if that prevents dividing up the array evenly or if unable to allocate a huge page in one of the memory banks. This would be one of the reasons for running benchmarks immediately after reboot.
These differences between Windows and linux are outside the control of the OpenMP library and may be associated with differing points of view held by the operating system designers (and maybe the larger pool of people tinkering with linux to find attractive modifications).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
On time critical applications, you typically want to perform the allocations, and first touch, at program start-up, and do it once. Thereafter you reuse the array, reinitializing it as necessary.
subroutine foo integer, dimension(:), allocatable, SAVE :: a integer :: dim=8000000 integer i if(.not. allocated(a)) allocate(a(1:dim)) !$OMP PARALLEL DO do i=1,dim a(i)=0 enddo !$OMP END PARALLEL DO
The second time you call foo, the initialization will be fast(er)
alternatively, init at program start:
module preallocated_data integer, dimension(:), allocatable :: a integer :: dim=8000000 ... contains subroutine preallocated_data_init() allocate(a(1:dim)) allocate(...) end subroutine preallocated_data_init end module preallocated_data ... program your_program use preallocated_data ... call preallocated_data_init ... end program your_program ... subroutine foo use preallocated_data integer i !$OMP PARALLEL DO do i=1,dim a(i)=0 enddo !$OMP END PARALLEL DO ...
Jim Dempsey
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page