I have a single-thread simulation code that spends about 15% of its time writing output data. Many speedup tricks have been implemented (autoparallelization, vectorization, etc.; but not (yet!) OpenMP, BLAS, or similar), and the latest brilliant idea I had was to fork a child process to do the writing. Given that I have idle processors, it offered a theoretical speedup of 15% and actual speedup of 13%. Not bad!
However, execution goes like this:
parallelizable stuff (60%); non-parallelizable stuff (25%); write output (15%); repeat.
So writing the data now occurs in parallel with tasks that are themselves parallelizable, and not during the part that is more or less non-parallelizable. If you see what I mean. Is there an easy way to signal my child process to wait until I say "go" to write its data to a file? After all, I have its process ID. Thus I can parallelize the parallelizable stuff without losing a processor, and make good use of the idle processors during the non-parallelizable part. I would like the solution to work on both linux and mac ifort versions 11.x. Is my now-idle child process going to get paged out to disk? That would be sad. Is this kind of thinking just not the right way to go when converting single-thread code to scalable parallel code?
I realize there are other ways to accomplish speedup (ignore the problem; rewrite the "non-parallelizable stuff" so it is parallelizable; don't overwrite data from the previous timestep, and fork when the non-parallelizable tasks begin), but those require large rewrites I'm not ready to handle.
One other option I haven't tried is to write the output to a big variable, then fork and write to a file when the non-parallelizable tasks start. How much speedup do you think I would get trying this? I suspect it would be a lot less than what I got. The output is one 4+MB ascii text file of 12-digit precision real numbers per iteration. (Yeah, writing a binary file would be a lot better, wouldn't it? Not gonna happen for a couple of months.)
Any help would be appreciated!
- Fred
链接已复制
2 回复数
asynchronous i/o? whoah, i didn't know that existed! i'll give it a try! (we won't know how well it worked until i manage to max out all processors).
i have one followup question, though, which came up just now when i investigated asynchronous i/o: what defines a "small" write (which asynchronous i/o is bad at), versus a "large" write (which it's good at)?
my code looks like this (lots of "small" writes; see below). how can i convert it to "large" writes? i have complete control over the shape and arrangement of data within the arrays that are written. (that is to say, if i knew what "large" writes looked like, i could probably rewrite the code to use them.)
- Fred
! this writes x, y, and z values of data points in 3-space.
! all the x's come first, on one line.
! then all the y's, on a second line.
! then all the z's, on a third line.
do i = 1, MAX_NODES ! roughly 100 to 10,000
write(gmv_file,'(f)',advance="NO") x_lin(i,1)
enddo
write(gmv_file,*)
do i = 1, MAX_NODES
write(gmv_file,'(f)',advance="NO") x_lin(i,2)
enddo
write(gmv_file,*)
do i = 1, MAX_NODES
write(gmv_file,'(f)',advance="NO") x_lin(i,3)
enddo
write(gmv_file,*)
write(gmv_file,*)
! now write integers representing node numbers which
! form hexahedral elements (ie, are shaped like deformed cubes).
do i = 1, MAX_ELEMENTS ! roughly 100 to 10,000
do j = 1, 8
write(gmv_file,'(i8)', advance="NO") IEN_LIN(i,j)
enddo
write(gmv_file,*)
enddo
write(gmv_file,*)
