- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Colleagues,
I'm attempting to upgrade a radiative transfer engineering application to take advantage of Coarray Fortran. (And I'm using Steve's recently posted code to determine the number of physical cores, since hyperthreading slows this application down). The code processes several thousand surfaces and performance improves considerably if I estimage the workload of each surface and produce separate subsets of surfaces, one subset for each image to process. But predicting workload is very difficult and the ratio of the greatest to least work between images remains as high as 2:1 A better distribution of work would reduce the overall execution time.
Idealy, I would like the images to run completely asynchronously, checking a constantly updated shared logical vector that records which surfaces have been processed. If each image begines at an appropriate offset into the list and flags a surface as done, then work load would NOT have to be predicted in advace. The images would simply work their way through the list, each surrface taking whatever time it requires, and each image updating (asynchronously) the logical vector. Something like the following:
AlreadyDone(1:10000)[1] = .false.
sync all
do i = NumCurrentImage,10000
if( AlreadyDone(i)[1] ) cycle
AlreadyDone(i)[1] = .true.
.
.
(computational work done here)
.
.
end do
But the 1st image runs away with the process. The 2nd and subsequent images work on their own first surface but when they check the vector all surfaces have been flagged as done. I take it that communcation between images is not fast enough to be used in this way. The difficulty can be demonstrated with a bit of code that generate a log file of the work done by each image (flux is just a throw-away to give the loop something to do):
AlreadyDone(1:10000)[1] = .false.
open(888,file = 'LogOfSurfaceWork['//char(48+NumCurrentImage)//'].dat' )
sync all
do i = this_image(),10000
write(888,*) i, AlreadyDone(i)[1]
if( AlreadyDone(i)[1] ) cycle
AlreadyDone(i)[1] = .true.
flux(i) = log(float(NumCurrentImage*i*j))*NumCurrentImage
end do
close(888)
If, instead of always reading/writing only the 1st instance of AlreadyDone, I have each image check the local instance and update all instances in the other images, then work is shared between images as I would expect. But the communcation time completely swamps any advantage of having multiple images. Have I missed something? (Or done something stupid?) Is there a better way to do what I'm attempting?
I am running Win7, Studio 2010, with Intel® Inspector XE 2011 Update 10, (build 233717)
P.S. Formating in the new Forum is a nightmare.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page