Execution speed with OpenMP and allocatable arrays

gib · ‎01-21-2009

I've been trying to understand why a program doesn't give me the speed increase I expect with multiple processors, using OpenMP, and I've created a test program that demonstrates what seems to me to be a surprising consequence of using allocatable arrays. Note that this test code can easily be improved, but it does serve as an example.

Build with: ifort /Qopenmp speedtest.f90

I've run 4 cases, setting the number of processors used, Mnodes, equal to 1 or 3, and using static or allocatable arrays for c() and csum() in subroutine par_tester(). (Change to allocatable by uncommenting a few lines and commenting one line).

The time results (secs) I get on my Intel quad-core are as follows:

Mnodes=1 Mnodes=3
static case 8.1 3.7
allocatable 8.6 7.6 - 10.9

The allocatable/Mnodes=3 times are the min and max of 10 trials, the other times were pretty repeatable.
Two surprising things here: (1) with allocatable arrays the program can actually run more slowly with 3 processors, even though my work-sharing clearly works with static arrays, (2) the execution time varies widely when allocatable arrays are used with 3 processors.

TimP · ‎01-22-2009

If you are testing to see whether you get super-linear speedup on allocate and deallocate with threading, I'm not surprised that you don't. Apparently, you perform a similar operation in each thread, so it is effectively not parallelized, and Amdahl's law catches up with you. So does the overhead of forking and joining when there is little work per thread beyond allocate/deallocate, even if the overhead is as little as 10% for one thread. I didn't observe so much variation, and your threaded allocatable case always ran under 6 seconds on my Core 2 laptop, running on linux. You didn't say anything about Thread Checker or openmp_profile results.

jimdempseyatthecove · ‎01-22-2009

Quoting - gib

I've been trying to understand why a program doesn't give me the speed increase I expect with multiple processors, using OpenMP, and I've created a test program that demonstrates what seems to me to be a surprising consequence of using allocatable arrays. Note that this test code can easily be improved, but it does serve as an example.

Build with: ifort /Qopenmp speedtest.f90

I've run 4 cases, setting the number of processors used, Mnodes, equal to 1 or 3, and using static or allocatable arrays for c() and csum() in subroutine par_tester(). (Change to allocatable by uncommenting a few lines and commenting one line).

The time results (secs) I get on my Intel quad-core are as follows:

Mnodes=1 Mnodes=3
static case 8.1 3.7
allocatable 8.6 7.6 - 10.9

The allocatable/Mnodes=3 times are the min and max of 10 trials, the other times were pretty repeatable.
Two surprising things here: (1) with allocatable arrays the program can actually run more slowly with 3 processors, even though my work-sharing clearly works with static arrays, (2) the execution time varies widely when allocatable arrays are used with 3 processors.

The standard allocation rouitine when called in a multi-threaded application uses a critical section and thus serializes the access to the allocation routine. The allocation portion of multi-threaded application consumes the wall-clock time of the allocation portion of the serial application PLUS the entry and exit of the critical section PLUS any cache line evictions. Recomendations

Code to avoid allocations as much as possible.
If you have common typed or sized objects that (get allocated, deleted) * many times consider managing a thread-safe linked list of previously allocated objects. With a little extra effort you can add code to maintain a thread private free list of these previously allocated objects as well as a global list of these previously allocated objects. Thethreadprivate list can be managed without Interlocked function callsbut the global list will require the Interlocked function calls. Place defensive code in the Threadprivate section such that if its private pool exceeds a threshold a chunk of itsprivate pool is returned to the global pool (this can be done in one Interlocked operation).
Becareful of the ABA problem.

Jim Dempsey

gib · ‎01-22-2009

Quoting - jimdempseyatthecove

The standard allocation rouitine when called in a multi-threaded application uses a critical section and thus serializes the access to the allocation routine. The allocation portion of multi-threaded application consumes the wall-clock time of the allocation portion of the serial application PLUS the entry and exit of the critical section PLUS any cache line evictions. Recomendations

Code to avoid allocations as much as possible.
If you have common typed or sized objects that (get allocated, deleted) * many times consider managing a thread-safe linked list of previously allocated objects. With a little extra effort you can add code to maintain a thread private free list of these previously allocated objects as well as a global list of these previously allocated objects. Thethreadprivate list can be managed without Interlocked function callsbut the global list will require the Interlocked function calls. Place defensive code in the Threadprivate section such that if its private pool exceeds a threshold a chunk of itsprivate pool is returned to the global pool (this can be done in one Interlocked operation).
Becareful of the ABA problem.

Jim Dempsey

Very interesting, and a trap for young players. There are ways I can avoid allocations, with only a small cost in program complexity - now that I know about the issue.

Thanks again.

Gib

gib · ‎01-22-2009

Quoting - tim18

If you are testing to see whether you get super-linear speedup on allocate and deallocate with threading, I'm not surprised that you don't. Apparently, you perform a similar operation in each thread, so it is effectively not parallelized, and Amdahl's law catches up with you. So does the overhead of forking and joining when there is little work per thread beyond allocate/deallocate, even if the overhead is as little as 10% for one thread. I didn't observe so much variation, and your threaded allocatable case always ran under 6 seconds on my Core 2 laptop, running on linux. You didn't say anything about Thread Checker or openmp_profile results.

Hmm. I'm using Windows XP. I wasn't hoping for superlinear, or even linear speedup with the allocate/deallocate, far from it. I was just wondering what was causing my code to slow down when I used more than one thread, and found that it was caused by the allocatables. If I increase the array sizes, changing the 1 to 5, the slowdown resulting from allocation increases to a factor of 5 (with Mnodes = 3). Now I know better than to do repetitive allocate/deallocate in parallel sections.