- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I try to use openmp to enhance my program on multi-core PC. But find no effects.
I can not explain why. It is difference from what i read from books and docs.
Following is my demo code. In my laptop (intel i5 3210M , 2 cores, 4 Threads ), 2 threads is always slower than 1 thread.
Compile with /qopenmp and default release options.
program Console1 use omp_lib implicit none common /C1/ a, b integer, parameter :: N = 5000000 logical, parameter :: iopenmp = .true. real :: a(N), b(N) real :: cc integer :: i, j, m integer :: nThreads_default = 4, nThreads = 2 real :: t1, t2, t3 m= OMP_GET_NUM_PROCS() print *, 'cpu number: ', m m = OMP_GET_MAX_THREADS() print *, 'Local max threads by openmp: ', m print *, 'Parallel threads: ', nThreads call omp_set_num_threads( nThreads ) 10 call CPU_TIME(t1) !$omp PARALLEL do private(cc) if (iopenmp) do i = 1, N if (i==1) print*, 'parallel threads number:', OMP_GET_NUM_THREADS() a(i) = sin(cos(cos(sin(float(i))))) if (a(i) > 0.5D0) then b(i) = log(abs(a(i))) if (b(i) > 0D0) then cc = b(i) ** i else cc = abs(b(i)) ** (1/float(i)) end if else b(i) = log(abs(1-a(i))) if (b(i) > 0D0) then cc = b(i) ** i else cc = abs(b(i)) ** (1/float(i)) end if end if a(i) = abs(cc) ** ( a(i) ) end do !$omp end parallel do call CPU_TIME(t2) print *, 'total time: ', t2-t1 read*, m if (m <= 0) goto 10 print*, a(m), b(m) end program Console1
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Try moving the 'print' at line 30 to outside of the loop.
Having that external call (to the 'print' support routine) inside the loop is preventing many optimizations.
Also, for future reference, there is a /Qopt-report feature that will output lots of information into a file, and will help you with future optimizations.
--Lorri
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The CPU_Time function gives the total time on all CPU's. Try using SECOND or DSECND, which give elapsed (wall clock) time.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Calvin D R. wrote:
The CPU_Time function gives the total time on all CPU's. Try using SECOND or DSECND, which give elapsed (wall clock) time.
That's a valid point, but omp_get_wtime() is usually preferred.
cpu_time could be divided by wall clock time to show the average number of active threads.
In connection with the first response, ifort performs more dead code elimination when /Qopenmp is not set, so it's important to use benchmarks which actually perform work, at least to the extent that the compiler can't skip operations.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Ohoo, thanks Lorri, Calvin and Tim!
I will use omp_get_wtime() instead.
I spent much time on writing a test code to show the performance of openmp today, which is a wrong direction~
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
After implementing the advise did you see any improvement in speed?
If you saw a speed-up, would this be the case also for larger N?
My experience showed that large arrays are impeding OpenMP anyway, because some processors cannot pipeline arrays above a certain length.
johannes
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I may not have said it well, but my point is that ifort is more likely to optimize away dead code when not compiling with OpenMP. So, naive benchmarks may run longer because the compiler takes literally a request to perform redundant operations .
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
program Console1 use omp_lib implicit none common /C1/ a, b integer, parameter :: N = 5000000 logical, parameter :: iopenmp = .true. real(8) :: a(N), b(N) real(8) :: cc integer :: i, j integer :: nThreads, maxThreads, maxProcs, nRepeat, iRepeat real(8) :: t1, t2, t3 maxProcs = OMP_GET_NUM_PROCS() print *, 'number of procs: ', maxProcs maxThreads = OMP_GET_MAX_THREADS() print *, 'Local max threads by openmp: ', maxThreads print * print *,'Threads average time sums' nRepeat = 3 do nThreads = 1, maxThreads t3 = 0.0D0 call omp_set_num_threads( nThreads ) do iRepeat = 1, nRepeat t1 = omp_get_wtime() !$omp PARALLEL do private(cc) do i = 1, N a(i) = sin(cos(cos(sin(float(i))))) if (a(i) > 0.5D0) then b(i) = log(abs(a(i))) if (b(i) > 0D0) then cc = b(i) ** i else cc = abs(b(i)) ** (1/float(i)) end if else b(i) = log(abs(1-a(i))) if (b(i) > 0D0) then cc = b(i) ** i else cc = abs(b(i)) ** (1/float(i)) end if end if a(i) = abs(cc) ** ( a(i) ) end do !$omp end parallel do t2 = omp_get_wtime() t3 = t3 + t2-t1 end do ! iRepeat write(*,'(i4,6X,3(G19.12))') nThreads, t3 / nRepeat, sum(a), sum(b) end do ! nThreads end program Console1 --------------- number of procs: 8 Local max threads by openmp: 8 Threads average time sums 1 0.605881328695 4999990.75081 -2206052.80457 2 0.308599452022 4999990.75081 -2206052.80457 3 0.201850473105 4999990.75081 -2206052.80457 4 0.148289515482 4999990.75081 -2206052.80457 5 0.151592309431 4999990.75081 -2206052.80457 6 0.122735014030 4999990.75081 -2206052.80457 7 0.105541613419 4999990.75081 -2206052.80457 8 0.107511962454 4999990.75081 -2206052.80457
Core i7 2600K, 4 core, 8 thread, KMP_AFFINITY = scatter
number of procs: 8 Local max threads by openmp: 8 Threads average time sums 1 0.636691162363 4999990.75081 -2206052.80457 2 0.518888275139 4999990.75081 -2206052.80457 3 0.263102406946 4999990.75081 -2206052.80457 4 0.187013466532 4999990.75081 -2206052.80457 5 0.144245651861 4999990.75081 -2206052.80457 6 0.137363188279 4999990.75081 -2206052.80457 7 0.158516899683 4999990.75081 -2206052.80457 8 0.140115333100 4999990.75081 -2206052.80457
KMP_AFFINITY = compact
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The post #8 was run without specifying machine architecture (meaning multi-architecture code generated with architecture dispatch)
The following is KMP_AFFINITY=scatter and compiled with /QxHost (Core i7 2600K with AVX2).
number of procs: 8 Local max threads by openmp: 8 Threads average time sums 1 0.320277455884 4999990.75081 -2206052.80457 2 0.258740041095 4999990.75081 -2206052.80457 3 0.185027207248 4999990.75081 -2206052.80457 4 0.132966400745 4999990.75081 -2206052.80457 5 0.132015697969 4999990.75081 -2206052.80457 6 0.109969620748 4999990.75081 -2206052.80457 7 0.946516091935E-01 4999990.75081 -2206052.80457 8 0.957180853002E-01 4999990.75081 -2206052.80457
The runtime is too short to produce meaningful results.
Increasing N form 5,000,000 to 50,000,000 yields a better representation of the runtime vs threads (with scatter):
number of procs: 8 Local max threads by openmp: 8 Threads average time sums 1 5.52842842353 49999989.3970 -22060529.8062 2 2.79540038404 49999989.3970 -22060529.8062 3 1.82513873807 49999989.3970 -22060529.8062 4 1.36323879333 49999989.3970 -22060529.8062 5 1.58161374306 49999989.3970 -22060529.8062 6 1.10633571694 49999989.3970 -22060529.8062 7 0.950655315537 49999989.3970 -22060529.8062 8 0.842224370223 49999989.3970 -22060529.8062
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Jim, thank you for the code. I have to admit that I didn't know the KMP_affinity variable. The results with the loops in your code are statisfying, although with 4 threads the results are different from yours. Compiling with /QxHost did not change the results. I will now open a new topic where I am really desperate with OpenMP. If you would have a look.
1. These are my ( N=50 millions) results on a i5-2450M CPU@2.50 GHz:
KMP_AFFINITY=Scatter:
- number of procs: 4
- Local max threads by openmp: 4
- Threads average time sums
- 1 0.867108628464 4999990.75081 -2206052.74775
- 2 0.474677234888 4999990.75081 -2206052.74775
- 3 0.381072681785 4999990.75081 -2206052.74775
- 4 0.511814453173 4999990.75081 -2206052.74775
and KMP_AFFINITY=Compact:
- number of procs: 4
- Local max threads by openmp: 4
- Threads average time sums
- 1 0.862537666612 4999990.75081 -2206052.74775
- 2 0.582822411632 4999990.75081 -2206052.74775
- 3 0.399528809435 4999990.75081 -2206052.74775
- 4 0.509297441070 4999990.75081 -2206052.74775
johannes
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
KMP_AFFINITY=scatter is equivalent to the (more recent) standard OMP_PROC_BIND=spread. These have the advantage of using all cores before placing 2 threads on a core.
KMP_AFFINITY=compact is equivalent to OMP_PROC_BIND=close. There is a potential advantage in cache locality when you have 2 threads per core.
I haven't found OMP_PROC_BIND or OMP_PLACES working on typical non-Intel Windows OpenMP implementations, so I'm not surprised to see people recommending the Intel-specific KMP_AFFINITY. On linux, however, you should find the standard environment variables working on a variety of implementations.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
So now you are seeing improvement from 1 to 2, and 2 to 3 threads but not 3 to 4 threads. You see a similar issue on my system going from 7 to 8 threads. While this may be due to memory bandwidth issues, I suspect it may also be related to the OpenMP monitor thread that now appears to suck up noticeable computational resources. This did not seem to be the case on earlier versions of OpenMP (pre Task support).
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Such a large drop in performance from 3 to 4 hyperthreads seems likely to be associated with turbo mode questions (turbo boost depending on number and duration of hyperthreads).
I used to see persistent down-clocking while running 1 thread following a 4 thread parallel region on dual core. The effect seemed to be moderated by BIOS update, but the effect of net loss from 3 to 4 threads continues (as well as most parallel vectorized apps running better with Intel OpenMP with 2 threads, OMP_PLACES=cores).
I don't know whether there is a recent widely accepted ifort Windows method to display current CPU clock rate before entering your parallel region (and after it has been running a while). I prefer to use emulation of /proc/cpuinfo/ (the usual linux feature), so as to have portability between linux and Windows.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for pointing out the Turbo issue (I overlooked that). This could be tested as to if it is turbo boost (or more appropriately core throttling) by entering a parallel region, barrier, RDTSC (each thread), loop performing equal amount of work, RDTSC (each thread), exiting parallel region. Then comparing the clock ticks per thread to execute the same amount of work. If the difference in runtime is related to thread pairs of same core then it is likely due to switching out of Turbo Boost. However, if one thread within a core is found to be delayed, then this may be attributable to other work being done on the system (either by extra thread in application or different process/interrupt service).
Note, in the test code (#8), the parallel region is timed when all threads complete the region. Thus when other activity is occurring on the system during the test run, then the preemption delay time of one thread delays the formal exit of the parallel region by that amount of time. IOW effectively (within thread scalability calculation) multiplying the delay time by the number of threads. On my system, assuming 1% background activity, could potentially appear in the statistics as 8% longer (in my case it did not as I have less than 1% background activity when running tests).
Jim Dempsey

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page