How to get enhancement by openmp parallel computting

An_N_1 · ‎12-21-2015

I try to use openmp to enhance my program on multi-core PC. But find no effects.

I can not explain why. It is difference from what i read from books and docs.

Following is my demo code. In my laptop (intel i5 3210M , 2 cores, 4 Threads ), 2 threads is always slower than 1 thread.

Compile with /qopenmp and default release options.

    program Console1
    use omp_lib
    implicit none

    common /C1/ a, b
    integer, parameter :: N = 5000000
    logical, parameter :: iopenmp = .true.
    real :: a(N), b(N)
    real :: cc
    
    integer :: i, j, m
    integer :: nThreads_default = 4, nThreads = 2
    real :: t1, t2, t3
    
    
    m= OMP_GET_NUM_PROCS()
    print *, 'cpu number: ', m
    m = OMP_GET_MAX_THREADS()
    print *, 'Local max threads by openmp: ', m
    print *, 'Parallel threads: ', nThreads
    
    call omp_set_num_threads( nThreads )

10    call CPU_TIME(t1)    
    
    
!$omp PARALLEL do private(cc) if (iopenmp)

    do i = 1, N
        if (i==1) print*, 'parallel threads number:', OMP_GET_NUM_THREADS()

        a(i) = sin(cos(cos(sin(float(i)))))
        if (a(i) > 0.5D0) then
            b(i) = log(abs(a(i)))
            if (b(i) > 0D0) then
                cc = b(i) ** i
            else
                cc = abs(b(i)) ** (1/float(i))
            end if
            
        else
            b(i) = log(abs(1-a(i)))
            if (b(i) > 0D0) then
                cc = b(i) ** i
            else
                cc = abs(b(i)) ** (1/float(i))
            end if            
        end if
        a(i) = abs(cc) ** ( a(i) )
    end do
!$omp end parallel do
    
    call CPU_TIME(t2)    
    print *, 'total time: ', t2-t1
     
    read*, m
    if (m <= 0) goto 10
    print*, a(m), b(m)

    
    end program Console1

Lorri_M_Intel · ‎12-22-2015

Try moving the 'print' at line 30 to outside of the loop.

Having that external call (to the 'print' support routine) inside the loop is preventing many optimizations.

Also, for future reference, there is a /Qopt-report feature that will output lots of information into a file, and will help you with future optimizations.

--Lorri

Calvin_D_R_ · ‎12-22-2015

The CPU_Time function gives the total time on all CPU's. Try using SECOND or DSECND, which give elapsed (wall clock) time.

TimP · ‎12-22-2015

Calvin D R. wrote:

The CPU_Time function gives the total time on all CPU's. Try using SECOND or DSECND, which give elapsed (wall clock) time.

That's a valid point, but omp_get_wtime() is usually preferred.

cpu_time could be divided by wall clock time to show the average number of active threads.

In connection with the first response, ifort performs more dead code elimination when /Qopenmp is not set, so it's important to use benchmarks which actually perform work, at least to the extent that the compiler can't skip operations.

An_N_1 · ‎12-22-2015

Ohoo, thanks Lorri, Calvin and Tim!

I will use omp_get_wtime() instead.

I spent much time on writing a test code to show the performance of openmp today, which is a wrong direction~

Johannes_A_ · ‎12-25-2015

After implementing the advise did you see any improvement in speed?

If you saw a speed-up, would this be the case also for larger N?

My experience showed that large arrays are impeding OpenMP anyway, because some processors cannot pipeline arrays above a certain length.

johannes

TimP · ‎12-25-2015

I may not have said it well, but my point is that ifort is more likely to optimize away dead code when not compiling with OpenMP. So, naive benchmarks may run longer because the compiler takes literally a request to perform redundant operations .

jimdempseyatthecove · ‎12-26-2015

    program Console1
    use omp_lib
    implicit none

    common /C1/ a, b
    integer, parameter :: N = 5000000
    logical, parameter :: iopenmp = .true.
    real(8) :: a(N), b(N)
    real(8) :: cc
    
    integer :: i, j
    integer :: nThreads, maxThreads, maxProcs, nRepeat, iRepeat
    real(8) :: t1, t2, t3
    
    
    maxProcs = OMP_GET_NUM_PROCS()
    print *, 'number of procs: ', maxProcs
    maxThreads = OMP_GET_MAX_THREADS()
    print *, 'Local max threads by openmp: ', maxThreads
    print *
    print *,'Threads   average time             sums'
    nRepeat = 3
    
    do nThreads = 1, maxThreads
      t3 = 0.0D0
      call omp_set_num_threads( nThreads )
      do iRepeat = 1, nRepeat
        t1 = omp_get_wtime()    
    
!$omp PARALLEL do private(cc)
        do i = 1, N
          a(i) = sin(cos(cos(sin(float(i)))))
          if (a(i) > 0.5D0) then
            b(i) = log(abs(a(i)))
            if (b(i) > 0D0) then
                cc = b(i) ** i
            else
                cc = abs(b(i)) ** (1/float(i))
            end if
          else
            b(i) = log(abs(1-a(i)))
            if (b(i) > 0D0) then
                cc = b(i) ** i
            else
                cc = abs(b(i)) ** (1/float(i))
            end if            
          end if
          a(i) = abs(cc) ** ( a(i) )
        end do
!$omp end parallel do
    
        t2 = omp_get_wtime()
        t3 = t3 +  t2-t1
      end do ! iRepeat
      write(*,'(i4,6X,3(G19.12))') nThreads, t3 / nRepeat, sum(a), sum(b)
    end do ! nThreads
    
    end program Console1
---------------
 number of procs:            8
 Local max threads by openmp:            8

 Threads   average time             sums
   1       0.605881328695      4999990.75081     -2206052.80457
   2       0.308599452022      4999990.75081     -2206052.80457
   3       0.201850473105      4999990.75081     -2206052.80457
   4       0.148289515482      4999990.75081     -2206052.80457
   5       0.151592309431      4999990.75081     -2206052.80457
   6       0.122735014030      4999990.75081     -2206052.80457
   7       0.105541613419      4999990.75081     -2206052.80457
   8       0.107511962454      4999990.75081     -2206052.80457

Core i7 2600K, 4 core, 8 thread, KMP_AFFINITY = scatter

 number of procs:            8
 Local max threads by openmp:            8

 Threads   average time             sums
   1       0.636691162363      4999990.75081     -2206052.80457
   2       0.518888275139      4999990.75081     -2206052.80457
   3       0.263102406946      4999990.75081     -2206052.80457
   4       0.187013466532      4999990.75081     -2206052.80457
   5       0.144245651861      4999990.75081     -2206052.80457
   6       0.137363188279      4999990.75081     -2206052.80457
   7       0.158516899683      4999990.75081     -2206052.80457
   8       0.140115333100      4999990.75081     -2206052.80457

KMP_AFFINITY = compact

Jim Dempsey

jimdempseyatthecove · ‎12-26-2015

The post #8 was run without specifying machine architecture (meaning multi-architecture code generated with architecture dispatch)

The following is KMP_AFFINITY=scatter and compiled with /QxHost (Core i7 2600K with AVX2).

 number of procs:            8
 Local max threads by openmp:            8

 Threads   average time             sums
   1       0.320277455884      4999990.75081     -2206052.80457
   2       0.258740041095      4999990.75081     -2206052.80457
   3       0.185027207248      4999990.75081     -2206052.80457
   4       0.132966400745      4999990.75081     -2206052.80457
   5       0.132015697969      4999990.75081     -2206052.80457
   6       0.109969620748      4999990.75081     -2206052.80457
   7       0.946516091935E-01  4999990.75081     -2206052.80457
   8       0.957180853002E-01  4999990.75081     -2206052.80457

The runtime is too short to produce meaningful results.

Increasing N form 5,000,000 to 50,000,000 yields a better representation of the runtime vs threads (with scatter):

 number of procs:            8
 Local max threads by openmp:            8

 Threads   average time             sums
   1        5.52842842353      49999989.3970     -22060529.8062
   2        2.79540038404      49999989.3970     -22060529.8062
   3        1.82513873807      49999989.3970     -22060529.8062
   4        1.36323879333      49999989.3970     -22060529.8062
   5        1.58161374306      49999989.3970     -22060529.8062
   6        1.10633571694      49999989.3970     -22060529.8062
   7       0.950655315537      49999989.3970     -22060529.8062
   8       0.842224370223      49999989.3970     -22060529.8062

Jim Dempsey

Johannes_A_ · ‎12-27-2015

Jim, thank you for the code. I have to admit that I didn't know the KMP_affinity variable. The results with the loops in your code are statisfying, although with 4 threads the results are different from yours. Compiling with /QxHost did not change the results. I will now open a new topic where I am really desperate with OpenMP. If you would have a look.

1. These are my ( N=50 millions) results on a i5-2450M CPU@2.50 GHz:

KMP_AFFINITY=Scatter:

number of procs: 4
Local max threads by openmp: 4
Threads average time sums
1 0.867108628464 4999990.75081 -2206052.74775
2 0.474677234888 4999990.75081 -2206052.74775
3 0.381072681785 4999990.75081 -2206052.74775
4 0.511814453173 4999990.75081 -2206052.74775

and KMP_AFFINITY=Compact:

number of procs: 4
Local max threads by openmp: 4
Threads average time sums
1 0.862537666612 4999990.75081 -2206052.74775
2 0.582822411632 4999990.75081 -2206052.74775
3 0.399528809435 4999990.75081 -2206052.74775
4 0.509297441070 4999990.75081 -2206052.74775

johannes

TimP · ‎12-27-2015

KMP_AFFINITY=scatter is equivalent to the (more recent) standard OMP_PROC_BIND=spread. These have the advantage of using all cores before placing 2 threads on a core.

KMP_AFFINITY=compact is equivalent to OMP_PROC_BIND=close. There is a potential advantage in cache locality when you have 2 threads per core.

I haven't found OMP_PROC_BIND or OMP_PLACES working on typical non-Intel Windows OpenMP implementations, so I'm not surprised to see people recommending the Intel-specific KMP_AFFINITY. On linux, however, you should find the standard environment variables working on a variety of implementations.

jimdempseyatthecove · ‎12-27-2015

So now you are seeing improvement from 1 to 2, and 2 to 3 threads but not 3 to 4 threads. You see a similar issue on my system going from 7 to 8 threads. While this may be due to memory bandwidth issues, I suspect it may also be related to the OpenMP monitor thread that now appears to suck up noticeable computational resources. This did not seem to be the case on earlier versions of OpenMP (pre Task support).

Jim Dempsey

TimP · ‎12-27-2015

Such a large drop in performance from 3 to 4 hyperthreads seems likely to be associated with turbo mode questions (turbo boost depending on number and duration of hyperthreads).

I used to see persistent down-clocking while running 1 thread following a 4 thread parallel region on dual core. The effect seemed to be moderated by BIOS update, but the effect of net loss from 3 to 4 threads continues (as well as most parallel vectorized apps running better with Intel OpenMP with 2 threads, OMP_PLACES=cores).

I don't know whether there is a recent widely accepted ifort Windows method to display current CPU clock rate before entering your parallel region (and after it has been running a while). I prefer to use emulation of /proc/cpuinfo/ (the usual linux feature), so as to have portability between linux and Windows.

jimdempseyatthecove · ‎12-28-2015

Thanks for pointing out the Turbo issue (I overlooked that). This could be tested as to if it is turbo boost (or more appropriately core throttling) by entering a parallel region, barrier, RDTSC (each thread), loop performing equal amount of work, RDTSC (each thread), exiting parallel region. Then comparing the clock ticks per thread to execute the same amount of work. If the difference in runtime is related to thread pairs of same core then it is likely due to switching out of Turbo Boost. However, if one thread within a core is found to be delayed, then this may be attributable to other work being done on the system (either by extra thread in application or different process/interrupt service).

Note, in the test code (#8), the parallel region is timed when all threads complete the region. Thus when other activity is occurring on the system during the test run, then the preemption delay time of one thread delays the formal exit of the parallel region by that amount of time. IOW effectively (within thread scalability calculation) multiplying the delay time by the number of threads. On my system, assuming 1% background activity, could potentially appear in the statistics as 8% longer (in my case it did not as I have less than 1% background activity when running tests).

Jim Dempsey