I have a high-iteration-count outer loop that contains (at its very beginning) the read of a file that must be sequentially read. Inside this outer loop is another, low-iteration-count loop that performs calculations on the data read in. The read in question accesses a row of a sparse matrix. Each row contains an unknown number of non-zero elements.
do i = 1,BigNumber read(1) NumInRecord, Where(1:NumInRecord), Array(Where(1:NumInRecord)) !$omp parallel do do j = 1,SmallNumber .lot of work on input data end do !$omp end parallel do end do
I'm currently threading the inner loop -- which produces some increase in efficiency but VTune shows that the threading overhead is very large (since data has to be assigned anew for each iteration of the outer loop) and the work imbalance is also large (and probably unavoidable). The net result is relatively poor overall threading performance. I'm wondering if there is a way to thread the outer loop while still maintaining the necessary sequential access of the input file. I could make the file read code an OpenMP critical section, but that would only prevent multi-thread access. It would not (I believe) force sequential access.
Ideally, thread 1 would start, read the 1st record and begin its work, immediately after thread 1 has finished reading the file, thread 2 would start, read the file and begin its work. Similarly for the rest of the threads. It would be a kind of cascading of thread work. Since the read is at the start of the outer loop, "thread holdup" only has to happen long enough for data input, then the threads can work on the input data.
BTW: BigNumber is usually so large that there is not enough memory to establish Array(BigNumber,BigNumber), which would, of course, allow the outer loop to be threaded.
Is there an algorithm or OpenMP feature that would provide what I'm looking for?
One solution might be to make the file a direct-access file. I have no idea whether this will help though - your program would jump all around the file and that probably has its own costs.
An alternative is to read large chunks of the file - let the outer loop be sequental and read chunks - as many records as practical - at once and let the inner process the records in parallel
Assuming the record data can be processed independent of other records
... !$omp parallel private(NumInRecord, Where, j, ...any other privates), shared(Array) !$omp master do i=1,BigNumber !$omp task !$omp critical(ReadRecord) read(1) NumInRecord, Where(1,NumInRecord), Array(Where(1:NumInRecord)) !$omp end critical(ReadRecord) do j=1,SmallNumber .lot of work on input data exclusive to Array(Where(1:NumInRecord)) end do !$omp end task end do !$omp end master !$omp end parallel
Note, *** the above requires that each thread's read of data accesses .ONLY. the data just read. IOW there is absolutely no accessing of Array to prior record's data.
The conditions you've mentioned are satisfied: record data can be processed independent of other records and there is no accessing of Array other than for the data just obtained from the file.
BUT (sorry to be woolly-headed) I don't under stand how you're using the omp constructs. It's my understanding that a MASTER section means only the master thread does work on that section. In your modification of the code, that encompasses both loops. So, does the TASK section mean that that code (and only that section of code) can be passed to other threads? -- even though it's within the MASTER section? In this arrangement, is the CRITICAL section performed only by the master thread? That is, what thread is reading the record in the file?
As you can probably tell, I'm confused. Thanks for any clarification you can offer.
The master thread in the above loop is enqueuing OpenMP tasks to the thread pool of the parallel region. These tasks can be executed by any, all, or even the master thread. While I used !$omp master for the task pump, you can also use !$omp single.
Think of the region of code between the !$omp task and !$omp end task as an unnamed contained procedure (with extension of private shared and other OpenMP clauses).
Generally, but not always, one of the threads of a thread pool issues the tasks. Think of OpenMP tasks as a different way of issuing a C++ thread on a lambda function (with captures). I think Intel lifted some code from their TBB to implement tasks (seeing that introduction of OpenMP tasking also brought in TBB's parallel allocator).
I prefer Arjen Markus' approach of dividing the large matrix into blocks that fit into the available memory pool you allocate.
This should be a simpler application of !$OMP, but requires a more complex "Where" structure and some idea of the maximum size of "NumInRecord" when determining how many rows to read for each block.
It should not be too much to expect that the process that created the file can't provide max(NumInRecord) or even the list of NumInRecord values to assist generating a block structure, that is the number of rows for each block.
Understanding Jim's approach would be useful for other applications of !$OMP.
You did not indicate if rows need to be processed sequentially, which could be a further restriction for the suggested solutions.
( I would recommend that you use a different file unit number, rather than "1". Anything > 10 would be preferable )
It would be difficult to use direct access due to the records effectively being variable length.
An alternative to consider
... !$omp parallel private(NumInRecord, j, ...any other privates), shared(Array) !$omp master do i=1,BigNumber !$omp task !$omp critical(criticalReadRecord) read(10) NumInRecord call ReadRecord(NumInRecord) !$omp end critical(criticalReadRecord) do j=1,SmallNumber .lot of work on input data exclusive to Array(Where(1:NumInRecord)) end do !$omp end task end do !$omp end master !$omp end parallel ... contains subroutine ReadRecord(NumInRecord) integer :: NumInRecord, Where(1:NumInRecord) read(10) Where(1:NumInRecord), Array(Where(1:NumInRecord)) end subroutine ReadRecord .OR. subroutine ReadRecord(NumInRecord) integer :: NumInRecord integer, allocatable :: Where(:) allocate(Where(NumInRecord)) read(10) Where(1:NumInRecord), Array(Where(1:NumInRecord)) ! implicit deallocation of Where end subroutine ReadRecord
You could name each read record differently (ReadRecordSmall - stack, ReadRecordLarge - heap) and conditionally call either based on size of NumInRecord.