- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
I'm new in OpenMP and compiling fortran code in IntelFortranCompiler. The code is given below has the same computing time with serial calculation. Why? *Qopenmp is setted in Properties>Fortran>Language
!$OMP PARALLEL SHARED(EMatrix, Gmatrix, NTHREADS,CHUNK) PRIVATE(i,TID,J,K)
TID = OMP_GET_THREAD_NUM()
! PRINT *, TID
IF (TID .EQ. 0) THEN
NTHREADS = OMP_GET_NUM_THREADS()
! PRINT *, 'Starting matrix example with', 'NTHREADS','threads'
! PRINT *, 'Initializing matrices'
END IF
call OMP_SET_NUM_THREADS(8)
!---------Initialize matrices
!$OMP DO SCHEDULE(Static, CHUNK)
do 40 i = 1,40
do 20 J = 1,100
do 20 K=1,100
! PRINT *, 'Thread', TID, 'did row', J
EMatrix(J,K)=1
PRINT *, 'EMatrix ', (J,K), EMatrix(J,K)
20 Continue
40 Continue
!----------End of parallel region
!$OMP END PARALLEL
コピーされたリンク
- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
The way you have set this up, with all threads storing to each cache line (thus false sharing), parallelization is counter-productive even if the compiler agrees to implement it (e,g. what does Qopt-report say?).
Also, the print presumably involves serialization, but without the print, the serial compilation might automatically correct your loop nesting. In any event, if you can saturate a memory controller with a single thread, you may not expect to gain until you spread a much larger case over multiple controllers (multiple CPUs).
- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
This might be more in line with your purpose:
call OMP_SET_NUM_THREADS(8) ! effective on the next parallel region !$OMP PARALLEL SHARED(EMatrix, Gmatrix, NTHREADS,CHUNK) PRIVATE(i,TID,J,K) TID = OMP_GET_THREAD_NUM() IF (TID .EQ. 0) THEN NTHREADS = OMP_GET_NUM_THREADS() tStart = omp_get_wtime() ! REAL(8) END IF !---------Initialize matrices do 40 i = 1,40 ! all threads 40x !$OMP DO SCHEDULE(Static, CHUNK), COLLAPSE(2) do 20 K = 1,100 ! Outer loop - right index do 20 J=1,100 ! Inner loop - left index EMatrix(J,K) = TID + i ! For timing, use a value that the compiler cannot determine and optimize out 20 Continue 40 Continue IF (TID .EQ. 0) tEnd = omp_get_wtime() ! REAL(8) !$OMP END PARALLEL !----------End of parallel region print *,'Runtime = ', tEnd - tStart ! assure EMatrix looks like it is used ! else, compiler optimization may remove the entire loops if(sum(EMatrix) == 0) PRINT *,"Won't print"
Note, compiler optimization is very smart. if you produce results in a loop that is never used after the loop as well as in the loop, it will remove those statements. If the loop, after such removal, collapses to a null loop, the loop itself will be removed.
When the compiler can predetermine the results, the compiler may substitute the results for the loop. In the above case, had you inserted a constant into Ematrix, the DO I, J and K loops could have been removed by the compiler and the results calculated at runtime. In the original example, in the code you posted, the results were not used, therefore the array EMatrix may have been eliminated.
Jim Dempsey