Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.

openmp setup

Edson_B_
Beginner
604 Views
Dears colleagues, I´m having some problems with an OpenMP test program. The running time remains the same for any number of threads. My test code is: PROGRAM openmp !$ USE OMP_LIB IMPLICIT NONE INTEGER*4 :: i, num_threads REAL*8 :: start_time REAL*8 :: end_time INTEGER*4, ALLOCATABLE, DIMENSION (:) :: m ALLOCATE ( m (1000000000) ) m = 0 i = 0 DO num_threads = 1,8 CALL OMP_SET_NUM_THREADS ( num_threads ) start_time = OMP_GET_WTIME() !$OMP PARALLEL DO DO i = 1, 1000000000 m(i) = m(i) + 1 ENDDO !$OMP END PARALLEL DO end_time = OMP_GET_WTIME() WRITE ( *, 50 ) num_threads, end_time - start_time ENDDO 50 FORMAT ( ' Threads=', I2, '. Time ', F12.8,' seconds.' ) END PROGRAM openmp OUTPUT: Threads= 1. Time 0.44233354 seconds. Threads= 2. Time 0.41301712 seconds. Threads= 3. Time 0.38321273 seconds. Threads= 4. Time 0.40626812 seconds. Threads= 5. Time 0.40601481 seconds. Threads= 6. Time 0.39871999 seconds. Threads= 7. Time 0.70736344 seconds. Threads= 8. Time 0.40052118 seconds. Compilation options: - Release x64, - Fortran -> Preprocessor -> OpenMP Conditional Compilation -> Yes - Fortran -> Language -> Process OpenMP Directives -> Generate Parallel Code (/Qopnmp) Hadware: I7-3632QM SO - Windows10 I appreciate any help. Thanks in advance.
0 Kudos
9 Replies
TimP
Honored Contributor III
604 Views

As you have just 4 cores, you may see best performance at 4 threads when you SET OMP_PLACES=cores, at least if you make the number of iterations a multiple of 32 (assuming you compiled for AVX).

You may be getting enough extra turbo speedup that the single thread case nearly saturates your memory bandwidth, particularly if you don't have the fastest RAM for your CPU.

0 Kudos
TimP
Honored Contributor III
604 Views

Just in case, a reminder about -align:array32byte (VS property Data > Default Array Alignment )  With OMP_PLACES and aligned array, you should see at least 20% speedup from 1 to 2 threads (but it's not consistently repeatable with your test).

0 Kudos
Edson_B_
Beginner
604 Views

Hi Tim.

Thank you for your attetion. I made those changes in my code:

PROGRAM openmp
USE IFPORT
!$ USE OMP_LIB
      IMPLICIT NONE
      INTEGER*4 :: i, num_threads
      REAL*8 :: start_time
      REAL*8 :: end_time
      INTEGER*4, ALLOCATABLEDIMENSION (:) :: m
      LOGICAL(4) :: success
      success = SETENVQQ("OMP_PLACES=4")
      PRINT*, 'SET_ENV = ',success
      ALLOCATE ( m (1000000000) )
      m = 0
      i = 0
      DO num_threads = 1,8
            CALL OMP_SET_NUM_THREADS ( num_threads )
            start_time = OMP_GET_WTIME()
            !$OMP PARALLEL DO
            DO i = 1, 1000000000
                  m(i) = m(i) + 1
            ENDDO
            !$OMP END PARALLEL DO
            end_time = OMP_GET_WTIME()
            WRITE ( *, 50 ) num_threads, end_time - start_time
      ENDDO
50    FORMAT ( ' Threads=', I2, '. Time ', F12.8,' seconds.' )
END PROGRAM openmp

and get this output:

 SET_ENV =  T

 Threads= 1. Time   0.40303241 seconds.
 Threads= 2. Time   0.40521241 seconds.
 Threads= 3. Time   0.48451957 seconds.
 Threads= 4. Time   0.38711276 seconds.
 Threads= 5. Time   0.40265500 seconds.
 Threads= 6. Time   0.39864395 seconds.
 Threads= 7. Time   0.39979156 seconds.
 Threads= 8. Time   0.40455043 seconds.
 

I ran the program sometimes, and the results seem to me the same. What do you think? Am I doing something wrong?

I set some extra variables in VS:

Code Generation -> Enable Enhanced Instruction Set -> Intel(R) Advanced Vector Extensions (/arch:AVX)

Data -> Default Array Alignment -> 32 Bytes (/align:array32byte)

0 Kudos
TimP
Honored Contributor III
604 Views

I don't see that OMP_PLACES=4 could be a useful setting.  For me, it causes your program to throw exceptions and hang.  The setting "cores" doesn't mean number of cores, it means pinning 1 thread per core.  If you wish to check what the numeric settings are, you might be able to find it by setting KMP_AFFINITY=verbose along with OMP_PLACES=cores.

For me, after all the cores are used up, the result of verbose setting seems to be saying that additional threads are pinned to the last core.  I'm not sure this makes sense.

0 Kudos
Edson_B_
Beginner
604 Views

Thanks again Tim.

Sorry, my mistake about OMP_PLACES. I modified my program again, but I keep getting the same performance. I did not see the messages that should appear with the use of the environment variable KMP_AFFINITY = verbose. Am I doing something wrong again?

New code:

PROGRAM openmp
USE IFPORT
!$ USE OMP_LIB
      IMPLICIT NONE
      INTEGER*4 :: i, num_threads
      REAL*8 :: start_time
      REAL*8 :: end_time
      INTEGER*4, ALLOCATABLEDIMENSION (:) :: m
      LOGICAL(4) :: success
      success = SETENVQQ("OMP_PLACES=cores")
      PRINT*, 'SET_ENV = ',success
      success = SETENVQQ("KMP_AFFINITY=verbose")
      PRINT*, 'SET_ENV = ',success
      ALLOCATE ( m (1000000000) )
      m = 0
      i = 0
      DO num_threads = 1,8
            CALL OMP_SET_NUM_THREADS ( num_threads )
            start_time = OMP_GET_WTIME()
            !$OMP PARALLEL DO
            DO i = 1, 1000000000
                  m(i) = m(i) + 1
            ENDDO
            !$OMP END PARALLEL DO
            end_time = OMP_GET_WTIME()
            WRITE ( *, 50 ) num_threads, end_time - start_time
      ENDDO
50    FORMAT ( ' Threads=', I2, '. Time ', F12.8,' seconds.' )
END PROGRAM openmp
0 Kudos
TimP
Honored Contributor III
604 Views

KMP_AFFINITY=verbose messages should come out in the standard error stream, which you could direct to a file or mix with your screen output by the usual shell script options, e.g. your.exe > stdout.txt 2>& 1

I'm used to running these diagnostics from cmd or bash so not sure about what happens if you run by double-clicking on your .exe.

0 Kudos
jimdempseyatthecove
Honored Contributor III
604 Views

PROGRAM openmp
USE IFPORT
!$ USE OMP_LIB
      IMPLICIT NONE
      INTEGER*4 :: i, num_threads
      REAL*8 :: start_time
      REAL*8 :: end_time
      INTEGER*4, ALLOCATABLE, DIMENSION (:) :: m
      LOGICAL(4) :: success
      success = SETENVQQ("OMP_PLACES=cores")
      PRINT*, 'SET_ENV = ',success
      success = SETENVQQ("KMP_AFFINITY=verbose")
      PRINT*, 'SET_ENV = ',success
      ! ******** ADD **********
      !$OMP PARALLEL
      PRINT *,OMP_GET_THREAD_NUM()
      !$OMP END PARALLEL
      ! ********* end ADD *********
      ALLOCATE ( m (1000000000) )
      m = 0
      i = 0
      DO num_threads = 1,8
            CALL OMP_SET_NUM_THREADS ( num_threads )
            start_time = OMP_GET_WTIME()
            !$OMP PARALLEL DO
            DO i = 1, 1000000000
                  m(i) = m(i) + 1
            ENDDO
            !$OMP END PARALLEL DO
            end_time = OMP_GET_WTIME()
            WRITE ( *, 50 ) num_threads, end_time - start_time
      ENDDO
50    FORMAT ( ' Threads=', I2, '. Time ', F12.8,' seconds.' )
END PROGRAM openmp

When you run the above program, do how many thread numbers print out from the new code?

If you see one thread, then there is an environment variable issue. Or you are not generating threaded code.

The above program should show increasing performance up until you saturate the memory bus. Your i7-3632QM rates memory bandwidth at 25.6GB/s (max) - your experience will generally be less. Your array has 1G 4-byte elements. You are performing a read/modify/write or 8GB memory transactions. Therefore the minimum time would be 8GB /(something less than 25.6GB/s). Or something on the order of 0.25 seconds. You are seeing ~half this.

Lets probe the memory bus bandwidth issue by making the "compute" statement a little more compute.

      m(i) = m(i) + int(sqrt(real(m(i)))) + 1

If that does not scale up to 4x (your cpu has 4 cores) then something is interfering with your threading.

Jim Dempsey

0 Kudos
TimP
Honored Contributor III
604 Views

As a matter of minor interest, gfortran runs this test consistently faster at 4 threads than ifort on my i5-4200U (even though there is no 32-byte alignment option nor any OMP_PLACES), consistently slower at 1 thread.  ifort may be seeing more turbo speedup at 1 thread, due to perhaps spending less time multi-threaded prior to running the single thread timing test. 

Non-repeatabiilty of the first timing test under both ifort and gfortran might be helped by putting some kind of single thread warm-up loop ahead of the test.

My Ultrabook has no options to disable hyperthreading or turbo, although reverting to Windows 8.1 from 10 restored the limited BIOS setup menu.
 

0 Kudos
Edson_B_
Beginner
604 Views

Thank you very much Jim Dempsey and Tim P.
You were right Mr. Dempsey, there was the need to increase work for processor.
Sorry for the delay in answer.
Thanks again. 

0 Kudos
Reply