- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I've been trying to understand why a program doesn't give me the speed increase I expect with multiple processors, using OpenMP, and I've created a test program that demonstrates what seems to me to be a surprising consequence of using allocatable arrays. Note that this test code can easily be improved, but it does serve as an example.
Build with: ifort /Qopenmp speedtest.f90
I've run 4 cases, setting the number of processors used, Mnodes, equal to 1 or 3, and using static or allocatable arrays for c() and csum() in subroutine par_tester(). (Change to allocatable by uncommenting a few lines and commenting one line).
The time results (secs) I get on my Intel quad-core are as follows:
Mnodes=1 Mnodes=3
static case 8.1 3.7
allocatable 8.6 7.6 - 10.9
The allocatable/Mnodes=3 times are the min and max of 10 trials, the other times were pretty repeatable.
Two surprising things here: (1) with allocatable arrays the program can actually run more slowly with 3 processors, even though my work-sharing clearly works with static arrays, (2) the execution time varies widely when allocatable arrays are used with 3 processors.
Build with: ifort /Qopenmp speedtest.f90
I've run 4 cases, setting the number of processors used, Mnodes, equal to 1 or 3, and using static or allocatable arrays for c() and csum() in subroutine par_tester(). (Change to allocatable by uncommenting a few lines and commenting one line).
The time results (secs) I get on my Intel quad-core are as follows:
Mnodes=1 Mnodes=3
static case 8.1 3.7
allocatable 8.6 7.6 - 10.9
The allocatable/Mnodes=3 times are the min and max of 10 trials, the other times were pretty repeatable.
Two surprising things here: (1) with allocatable arrays the program can actually run more slowly with 3 processors, even though my work-sharing clearly works with static arrays, (2) the execution time varies widely when allocatable arrays are used with 3 processors.
Link Copied
4 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If you are testing to see whether you get super-linear speedup on allocate and deallocate with threading, I'm not surprised that you don't. Apparently, you perform a similar operation in each thread, so it is effectively not parallelized, and Amdahl's law catches up with you. So does the overhead of forking and joining when there is little work per thread beyond allocate/deallocate, even if the overhead is as little as 10% for one thread. I didn't observe so much variation, and your threaded allocatable case always ran under 6 seconds on my Core 2 laptop, running on linux. You didn't say anything about Thread Checker or openmp_profile results.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - gib
I've been trying to understand why a program doesn't give me the speed increase I expect with multiple processors, using OpenMP, and I've created a test program that demonstrates what seems to me to be a surprising consequence of using allocatable arrays. Note that this test code can easily be improved, but it does serve as an example.
Build with: ifort /Qopenmp speedtest.f90
I've run 4 cases, setting the number of processors used, Mnodes, equal to 1 or 3, and using static or allocatable arrays for c() and csum() in subroutine par_tester(). (Change to allocatable by uncommenting a few lines and commenting one line).
The time results (secs) I get on my Intel quad-core are as follows:
Mnodes=1 Mnodes=3
static case 8.1 3.7
allocatable 8.6 7.6 - 10.9
The allocatable/Mnodes=3 times are the min and max of 10 trials, the other times were pretty repeatable.
Two surprising things here: (1) with allocatable arrays the program can actually run more slowly with 3 processors, even though my work-sharing clearly works with static arrays, (2) the execution time varies widely when allocatable arrays are used with 3 processors.
Build with: ifort /Qopenmp speedtest.f90
I've run 4 cases, setting the number of processors used, Mnodes, equal to 1 or 3, and using static or allocatable arrays for c() and csum() in subroutine par_tester(). (Change to allocatable by uncommenting a few lines and commenting one line).
The time results (secs) I get on my Intel quad-core are as follows:
Mnodes=1 Mnodes=3
static case 8.1 3.7
allocatable 8.6 7.6 - 10.9
The allocatable/Mnodes=3 times are the min and max of 10 trials, the other times were pretty repeatable.
Two surprising things here: (1) with allocatable arrays the program can actually run more slowly with 3 processors, even though my work-sharing clearly works with static arrays, (2) the execution time varies widely when allocatable arrays are used with 3 processors.
The standard allocation rouitine when called in a multi-threaded application uses a critical section and thus serializes the access to the allocation routine. The allocation portion of multi-threaded application consumes the wall-clock time of the allocation portion of the serial application PLUS the entry and exit of the critical section PLUS any cache line evictions. Recomendations
Code to avoid allocations as much as possible.
If you have common typed or sized objects that (get allocated, deleted) * many times consider managing a thread-safe linked list of previously allocated objects. With a little extra effort you can add code to maintain a thread private free list of these previously allocated objects as well as a global list of these previously allocated objects. Thethreadprivate list can be managed without Interlocked function callsbut the global list will require the Interlocked function calls. Place defensive code in the Threadprivate section such that if its private pool exceeds a threshold a chunk of itsprivate pool is returned to the global pool (this can be done in one Interlocked operation).
Becareful of the ABA problem.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - jimdempseyatthecove
The standard allocation rouitine when called in a multi-threaded application uses a critical section and thus serializes the access to the allocation routine. The allocation portion of multi-threaded application consumes the wall-clock time of the allocation portion of the serial application PLUS the entry and exit of the critical section PLUS any cache line evictions. Recomendations
Code to avoid allocations as much as possible.
If you have common typed or sized objects that (get allocated, deleted) * many times consider managing a thread-safe linked list of previously allocated objects. With a little extra effort you can add code to maintain a thread private free list of these previously allocated objects as well as a global list of these previously allocated objects. Thethreadprivate list can be managed without Interlocked function callsbut the global list will require the Interlocked function calls. Place defensive code in the Threadprivate section such that if its private pool exceeds a threshold a chunk of itsprivate pool is returned to the global pool (this can be done in one Interlocked operation).
Becareful of the ABA problem.
Jim Dempsey
Thanks again.
Gib
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - tim18
If you are testing to see whether you get super-linear speedup on allocate and deallocate with threading, I'm not surprised that you don't. Apparently, you perform a similar operation in each thread, so it is effectively not parallelized, and Amdahl's law catches up with you. So does the overhead of forking and joining when there is little work per thread beyond allocate/deallocate, even if the overhead is as little as 10% for one thread. I didn't observe so much variation, and your threaded allocatable case always ran under 6 seconds on my Core 2 laptop, running on linux. You didn't say anything about Thread Checker or openmp_profile results.

Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page