Intel® Moderncode for Parallel Architectures
Support for developing parallel programming applications on Intel® Architecture.

speedup problem using openMP in intel fortran

bohluly
Beginner
302 Views

Dear all,

I have developed  a program and unfortunately I have speedup problem in it. My program is so big so I have tried to write a sample similar to my program, fortunately this simple program has a same problem with my program. 

I need other experiences and your help if it is possible.

Thanks,

I am using VS2010 and Intel FORTRAN XE 2011

Program:

    TYPE var
        REAL(8),POINTER :: A, B, C
     END TYPE var
     REAL(8),POINTER :: A(:), B(:), C(:)
     TYPE(var),POINTER  :: vars(:)  
     TYPE(var),POINTER :: varOMP
     
     REAL*8  t1,t2 ,ai,bi,ci,di,ei,fi  
     INTEGER(4) c1,c2
     INTEGER N, CHUNKSIZE, I, id, f , l
     PARAMETER (N=200)
     PARAMETER (CHUNKSIZE=10)
     
     Allocate (A(N), B(N), C(N),vars(N))

!     initializations
 
      DO I = 1, N
         A(N)      =   I * 1.0
         B(N)      =   A(N)
         vars(I)%A =>  A(N)
         vars(I)%B =>  B(N)
         vars(I)%C =>  C(N)
         vars(I)%A = 0.51
         vars(I)%B = 0.45
      ENDDO
      
      CALL SYSTEM_CLOCK(c1)
      Do Itration=1,1000000
!z$OMP PARALLEL PRIVATE(I,varOMP,ai,bi,ci,ei ,di,fi )
!z$OMP DO SCHEDULE(STATIC,CHUNKSIZE)
      DO I = 1, N
        varOMP => vars(I)
       ai = varOMP%A
       bi = varOMP%B
       di = ai*2.2 + bi * 2.0  + ai*2.1
       ei = ai*2.1 + bi * 2.3  + ai*2.15
       fi = di * ( ai + bi )*2.1 + ei * ( di + bi )*2.1
       ci = bi*2.1 + ai*2.0 + di*2.0 + ei*2.0 + fi*2.0
        varOMP%C = ci
      ENDDO
!z$OMP END DO  
!z$OMP END PARALLEL
      ENDDO
      
      CALL SYSTEM_CLOCK(c2)
      WRITE(*,*) c2-c1
       
      STOP

      END

0 Kudos
2 Replies
bohluly
Beginner
302 Views

Sorry, in last compile I have changed  !$OMP to !z$OMP for testing the program using openMP please remove z from !z$OMP in sample.

0 Kudos
TimP
Honored Contributor III
302 Views

To see an advantage for threading you will need an outer parallel loop with a count such as 1000 with an inner vectorizable loop of at least  the size shown. System_clock works better with 64 bit arguments.mm 

 

threading can't compensate for inefficient data access. If multiple threads write to the same cache line (false sharing) it will perform poorly.
0 Kudos
Reply