The way you have set this up,

f_y_ · ‎10-06-2016

I'm new in OpenMP and compiling fortran code in IntelFortranCompiler. The code is given below has the same computing time with serial calculation. Why? *Qopenmp is setted in Properties>Fortran>Language

!$OMP PARALLEL SHARED(EMatrix, Gmatrix, NTHREADS,CHUNK) PRIVATE(i,TID,J,K)

      TID = OMP_GET_THREAD_NUM()
 !      PRINT *, TID
      IF (TID .EQ. 0) THEN
        NTHREADS = OMP_GET_NUM_THREADS()
!        PRINT *, 'Starting matrix example with', 'NTHREADS','threads'
!        PRINT *, 'Initializing matrices'       
      END IF
    call OMP_SET_NUM_THREADS(8)
!---------Initialize matrices
!$OMP DO SCHEDULE(Static, CHUNK)
do 40 i = 1,40
    do 20 J = 1,100
	    do 20 K=1,100
!        PRINT *, 'Thread', TID, 'did row', J
        EMatrix(J,K)=1   
        PRINT *, 'EMatrix ', (J,K), EMatrix(J,K)       
20  Continue
40  Continue       
!----------End of parallel region
   !$OMP END PARALLEL

TimP · ‎10-06-2016

The way you have set this up, with all threads storing to each cache line (thus false sharing), parallelization is counter-productive even if the compiler agrees to implement it (e,g. what does Qopt-report say?).

Also, the print presumably involves serialization, but without the print, the serial compilation might automatically correct your loop nesting. In any event, if you can saturate a memory controller with a single thread, you may not expect to gain until you spread a much larger case over multiple controllers (multiple CPUs).

jimdempseyatthecove · ‎10-06-2016

This might be more in line with your purpose:

    call OMP_SET_NUM_THREADS(8) ! effective on the next parallel region
!$OMP PARALLEL SHARED(EMatrix, Gmatrix, NTHREADS,CHUNK) PRIVATE(i,TID,J,K)
      TID = OMP_GET_THREAD_NUM()
      IF (TID .EQ. 0) THEN
        NTHREADS = OMP_GET_NUM_THREADS()
        tStart = omp_get_wtime() ! REAL(8)
      END IF
!---------Initialize matrices
do 40 i = 1,40 ! all threads 40x
!$OMP DO SCHEDULE(Static, CHUNK), COLLAPSE(2)
    do 20 K = 1,100 ! Outer loop - right index
      do 20 J=1,100 ! Inner loop - left index
        EMatrix(J,K) = TID + i ! For timing, use a value that the compiler cannot determine and optimize out   
 20   Continue
40  Continue 
   IF (TID .EQ. 0) tEnd = omp_get_wtime() ! REAL(8)
   !$OMP END PARALLEL
!----------End of parallel region
   print *,'Runtime = ', tEnd - tStart
   ! assure EMatrix looks like it is used
   ! else, compiler optimization may remove the entire loops
   if(sum(EMatrix) == 0) PRINT *,"Won't print"

Note, compiler optimization is very smart. if you produce results in a loop that is never used after the loop as well as in the loop, it will remove those statements. If the loop, after such removal, collapses to a null loop, the loop itself will be removed.

When the compiler can predetermine the results, the compiler may substitute the results for the loop. In the above case, had you inserted a constant into Ematrix, the DO I, J and K loops could have been removed by the compiler and the results calculated at runtime. In the original example, in the code you posted, the results were not used, therefore the array EMatrix may have been eliminated.

Jim Dempsey

Serial processing time is the same with Parallel