currently I am working on the parallelization of a Monte Carlo code for particle transport written in FORTRAN 77 using OpenMP and the Intel Fortran Compiler (version 2017). I have successfully implemented the parallelization and on several tests in general I obtain a better performance (up to 30% faster) than the original parallelization scheme of the code, which relies of running several processes in parallel using a Batch Queuing System.
However, my problem is when the particles information (energy, position, direction) is read from a file (each line of this file contains the information of one particle, and therefore several read operations are done during the simulation) the performance of the code using OpenMP drops severally and then it is similar to the performance obtained with the original parallel implementation. Therefore I am essentially losing the original advantage of the OpenMP implementation when a file is used as an input for the simulation.
I really do not know if I am using the best approach and therefore I would like to ask you for advice. My approach is that each thread opens the file with its own unit number, as shown in the following snippet:
C$OMP0PARALLEL DEFAULT(SHARED) omp_iam = OMP_GET_THREAD_NUM() omp_size = OMP_GET_NUM_THREADS() nnphsp = INT(omp_iam*nshist/omp_size)+1 UNIT_PHSP = 44 + omp_iam OPEN(UNIT=UNIT_PHSP,FILE=NAME_PHSP,FORM='UNFORMATTED',ACCESS=' *DIRECT', RECL=PHSP_RECL,STATUS='OLD',IOSTAT=IERR_PHSP) C$OMP0END PARALLEL
The file containing the particle information is divided among the threads through the nnpsh variable. Therefore none of the threads reads the same line during the simulation. Then, in order to read a line the following code is used:
READ(UNIT_PHSP,REC=nnphsp,IOSTAT=IERR_PHSP) latchi,ESHOR *T,X_PHSP_SHORT,Y_PHSP_SHORT, U_PHSP_SHORT,V_PHSP_SHORT,W *T_PHSP_SHORT
I would like to know if there is a better approach to the above mentioned. A distinctive characteristic of this code is that due to the stochastic nature of the Monte Carlo simulation the threads do not access the file at the same time or in an ordered way. Each time that a new particle is going to be simulated its information is read from the file. Thanks for your help!
In general, individual small reads/writes tend to be slower than the same data grouped into fewer, larger reads/writes, so if you can read in a bunch of particles at once (contiguous records) instead of one-at-a-time, you should see some speed-up.
How large is the file and do you have enough memory that one copy can fit in core at a time? If it fits, you could read the whole file in. (One copy, for all threads to share.) The threads could access it as needed.
How many threads are you running? In the extreme case, more files open => slower performance, but I don't know at what point this becomes apparent for different types of systems.
The read into one common buffer is a good choice when the read data is of reasonable size (TBD).
An alternate choice is to consider using OpenMP tasking to partition the work into a pipeline. One thread reads the file and passes the buffer onto a splitter task, the splitter task distribures (via task directives) portions of the buffer as different task (this would be the subset of the file dataset to be processed by an individual thread), then returns the buffer to the buffer pool.
When the input file is too large to conveniently fit in memory (or takes too long to have threads waiting for the file read), then the pipeline approach is more efficient.
From your code snip 2 in post #1, is it reasonable to assume that immediately following the READ (or at the end of the loop, or in DO loop control) you have nnphsp = nnphsp +1? IOW each thread is reading nshist/omp_size number of records starting at an offset.
If so, consider changing your code to use a READ of multiple records by the master thread, then divvy up those records by thread. As you currently have it written, you have multiple serialized reads to different areas of the file (excessive seek activity).
Thanks to all for your opinions, the size of the file is around 1GB, so it would be possible to just read the entire file and copy it to memory and then distribute the particles among the threads. The program is designed in order to use the entire file in the simulation.
Jim, as you said immediately following the READ nnphsp is increased and each thread reads nshist/omp_size number of records starting at an offset. I will try to implement that the master thread reads the file and then divide the records among the threads.
Thanks for your help!
What is unknown to me is the ratio of the READ time to the processing time. If (when) the read time has a significant portion of the overall runtime, consider reading in manageable pieces and distributing work to the worker threads for each piece, then repeat for each piece. Make the interrelationship between the READing thread minimal. IOW for when using one large buffer, incorporate a read progress indicator (worker threads wait for progress to indicate something to do, or quit. When memory is a concern, consider using a double buffer technique. The reader thread reads into empty/finished buffers and indicates buffer has something to do. When last thread finishes with buffer, it flags the buffer as done. Double buffer technique can be extended to more buffers when read latency varies. A formal parallel pipeline would have at least one more buffer than the number of worker threads. The system has a queue of empty/finished buffers. When the queue has something, the reader threads reads into the buffer or indicates end of file into the buffer, then enqueues (or flags) the buffer is ready to be processed. Any of the non-working worker threads can claim the buffer for processing. After the worker thread finishes processing the buffer, it enqueues (or flags) the buffer as free. Depending on the amount of processing time per buffer (looks like small in your case), the total runtime estimate would be the file read time plus the processing of the last buffer.
doing some tests I have found the following. I ran my program in three cases: the default parallel program (reading + processing), the parallel program with just reading and a serial version (OpenMP disabled) where the file is just read. The execution times are the following:
DEFAULT CASE : 2846.8 s
PARALLEL READING : 150.9 s
SERIAL READING : 25.9 s
So it seems that OpenMP bloats the I/O of the program, but I am not sure if the additional time finally affects the entire parallelization (it is around 6% of the total execution time in the parallel case). Should I try to improve the I/O of my program or I must look for other causes to the program's overhead?. Thanks for your help!
Please show or sketch your parallel reading.
For some problems, one (i.e. you) might construct the parallel reading by taking the entire file scope (record 1 : record n) and partitioning it in a linear fashion. IOW thread 0 reads 1:n/nThreads, thread 1 has the next chunk of n/nThreads of records, etc... thus causing your "parallel" implementation to perform disk seeks between thread's reads of portions of their chunks. This is inefficient use of reading the file.
What I suggested doing is entirely different. Instead you sequentially read the file by one thread in smaller chunks than n/nThreads. As each chunk is read, it is passed on to an available worker thread.
A list of buffers is allocated, where the buffer count is at least one greater than number of worker threads. The size of each buffer is usually determined through testing to optimize performance. One of the threads, say the master thread, sequentially reads the input file into available buffers (waits when none available). As each buffer is read, the buffer is enqueue onto the processing threads. When a processing thread completes, then depending on what you need to do next, the post-processing migh:
a) simply marks buffer as free (reader thread continues reading into now available buffer)
b) enqueue's buffer to writer thread's queue
c) enqueue's buffer into collating queue which sequences buffers in original input order and is written by writer thread.
Note, the read thread and write thread typically will block during I/O by the O/S. You may wish to experiment with oversubscription.
At issue with creating such a system is when the reader thread runs out of buffers and must wait, how do you construct the wait. There are many ways of doing this. You could spinwait using SLEEPQQ with 0ms or 1ms waiting for a "buffer done" flag, you could use OpenMP locks, or other means of synchronization
I suggest that you experiment using OpenMP tasking to enqueue the buffers to the worker threads. This is relatively easy to construct. The "wait until buffer available" part is a little bit trickier to program, but should be within the scope of your skill level. You just need motivation to do it. If you can recover an additional 2 minutes per run, this might be worth it if you anticipate making 100's, 1000's of runs. Note, do not use OpenMP tasking to enqueue the finished buffers, because you have a specific thread performing the reads (similar situation with a writer thread should you have one).