Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.

OMP Program

JohnNichols
Valued Contributor III
1,806 Views

Barbara:

You put up a simple OMP Program.  I tried it on the NUC and it did not complete, I thought it was throwing an error, but that was my mistake. 

I amended the program so it gives you some idea of how it can perform. 

program matrix_multiply
    use omp_lib
    implicit none
    integer :: i, j, k, myid, m, n,i1
    integer alloc_err
    real(8), allocatable, dimension(:,:) :: a, b, c, c_serial
    REAL time_begin, time_end
    

    do 100 i1 = 1, 11
        if(i1 .eq. 1) then
            n = 4
        else if(i1 .eq. 2) then
            n = 10
        else if(i1 .eq. 3) then
            n = 40
        else if(i1 .eq. 4) then
            n = 100
        else if(i1 .eq. 5) then
            n = 200
        else if(i1 .eq. 6) then
            n = 400
        else if(i1 .eq. 7) then
            n = 700
        else if(i1 .eq.  then
            n = 1000
        else if(i1 .eq. 9) then
            n = 1500
        else if(i1 .eq. 10) then
            n = 2000
        else if(i1 .eq. 11) then
            n = 2600
        end if
        
    CALL CPU_TIME ( time_begin )
        myid = OMP_GET_THREAD_NUM()
        if (myid .eq. 0) then
            print *, 'matrix size ', n
            print *, 'Number of CPU procs is ', OMP_GET_NUM_THREADS()
        else
            print *, 'Number of OpenMP Device Available:', omp_get_num_devices()
            !$omp target
            if (OMP_IS_INITIAL_DEVICE()) then
                print *, ' Running on CPU'
            else
                print *, ' Running on GPU'
            endif
            !$omp end target
        endif
        
        allocate( a(n,n), b(n,n), c(n,n), c_serial(n,n))

        ! Initialize matrices
        do j=1,n
            do i=1,n
                a(i,j) = i + j - 1
                b(i,j) = i - j + 1
            enddo
        enddo
        c = 0.0
        c_serial = 0.0

        !$omp target teams map(to: a, b) map(tofrom: c)
        !$omp distribute parallel do SIMD private(j, i, k)
        ! parallel compute matrix multiplication.
        do j=1,n
            do i=1,n
                do k=1,n
                    c(i,j) = c(i,j) + a(i,k) * b(k,j)
                enddo
            enddo
        enddo
        !$omp end target teams

        ! serial compute matrix multiplication
        do j=1,n
            do i=1,n
                do k=1,n
                    c_serial(i,j) = c_serial(i,j) + a(i,k) * b(k,j)
                enddo
            enddo
        enddo

        ! verify result
        do j=1,n
            do i=1,n
                if (c_serial(i,j) .ne. c(i,j)) then
                    print *,'FAILED, i, j, c_serial(i,j), c(i,j) ', i, j, c_serial(i,j), c(i,j)
                    exit
                endif
            enddo
        enddo

        write(*,50)i1
50      Format('The test PASSED --- ok so this is the 'i3,'   wait for the tenth step!')
        CALL CPU_TIME ( time_end )
        WRITE (*,200)i1, time_end - time_begin
        if(i1 .eq. 4 .and. (time_end - time_begin .gt. 0.1)) then
            write(*,300)
300         Format(//"                  You crack me up, please it will not finish in your lifetime on this old computer, update human!"//"                 Can someone make a cup of tea please!",//)
200         Format('Loop :: ',i3'  Time of operation was ',F10.3 , ' seconds')
        endif
        DEALLOCATE(A, B, C, C_Serial,STAT = ALLOC_ERR)
100 end do

    end program matrix_multiply

 

 

The program runs on a single processor in cubic n, so by the time you get to 2600 it takes 38 minutes to run. 

I enclose a graph, I ran it three times, first as your program and then twice as the modified. 

Anyway, it was fun.  

 

JohnNichols_0-1689268155778.png

 

12 Replies
jimdempseyatthecove
Honored Contributor III
1,772 Views

John,

Your CPU time interval includes:

a) Identification as to if CPU or GPU

b) allocation of arrays

c) initialization of arrays

d) the (potentially) parallel matrix multiplication

e) the serial matrix multiplication

f) the verify

 

As opposed to timing each of d) and e).

 

Why was this done?

 

Jim Dempsey

0 Kudos
JohnNichols
Valued Contributor III
1,745 Views

Jim:

I had tried the program on my old NUC, about 6 years old, and at 2600 matrix size it seemed to never return, by a few hundred seconds, anyone is hitting keys to see what is happening.  It appeared to be locked at Line 69, but I was wrong.

The program is intended as a sample, what is the point of a sample that causes one to pull out one's remaining hair,  so I amended the matrix size to 4 and it was fast. So then I tried a few bigger numbers and then as the old Physics 101 cut in and I remembered Squires, I thought better time it, so I did and I got a parabola in the beginning, but it was a poor predictor, so I got longer data and moved to a cubic. 

My interest was in the whole process, not the OMP.  It is just to show someone who uses the sample, both the sample and the growth in time of the run time.  

It was merely a suggestion for Barbara.  

She must be on holidays, as she is normally a fast responder. 

I will run your suggestion today. 

John

I only had the NUC as I could not spare my Dell as it is doing real work that earns real money.  Plus it is the worst machine anyone is likely to have at the moment. 

It was just for fun in between writing an FE model generator and running said models.   

0 Kudos
JohnNichols
Valued Contributor III
1,733 Views

Jim:

I tried your suggestion, the program ran consistently slower.  

Here is the amended timing figure 

JohnNichols_0-1689436387757.png

The blue line is the lost time for your suggestion in centi-seconds, so I can plot it.  Interestingly n = 100 does not really fit the other data and I ran it a few times to check.  No idea why it is so much slower at 100. 

Even if I manage to OMP the program, 2600 n in a 10 core is still going to take 2 to 3 minutes.  

John

 

 

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,723 Views
program matrix_multiply
    use omp_lib
    implicit none
    integer :: i, j, k, myid, m, n,i1
    integer alloc_err
    real(8), allocatable, dimension(:,:) :: a, b, c, c_serial
    REAL time_begin, time_end_1, time_end_2
    

    do 100 i1 = 1, 11
        if(i1 .eq. 1) then
            n = 4
        else if(i1 .eq. 2) then
            n = 10
        else if(i1 .eq. 3) then
            n = 40
        else if(i1 .eq. 4) then
            n = 100
        else if(i1 .eq. 5) then
            n = 200
        else if(i1 .eq. 6) then
            n = 400
        else if(i1 .eq. 7) then
            n = 700
        else if(i1 .eq.  then
            n = 1000
        else if(i1 .eq. 9) then
            n = 1500
        else if(i1 .eq. 10) then
            n = 2000
        else if(i1 .eq. 11) then
            n = 2600
        end if
        
! ---   CALL CPU_TIME ( time_begin ) ! do not time the initialization
        myid = OMP_GET_THREAD_NUM()
        if (myid .eq. 0) then
            print *, 'matrix size ', n
            print *, 'Number of CPU procs is ', OMP_GET_NUM_THREADS()
        else
            print *, 'Number of OpenMP Device Available:', omp_get_num_devices()
            !$omp target
            if (OMP_IS_INITIAL_DEVICE()) then
                print *, ' Running on CPU'
            else
                print *, ' Running on GPU'
            endif
            !$omp end target
        endif
        
        allocate( a(n,n), b(n,n), c(n,n), c_serial(n,n))

        ! Initialize matrices
        do j=1,n
            do i=1,n
                a(i,j) = i + j - 1
                b(i,j) = i - j + 1
            enddo
        enddo
        c = 0.0
        c_serial = 0.0
! +++++++ time just the matmul
        CALL CPU_TIME ( time_begin )
        !$omp target teams map(to: a, b) map(tofrom: c)
        !$omp distribute parallel do SIMD private(j, i, k)
        ! parallel compute matrix multiplication.
        do j=1,n
            do i=1,n
                do k=1,n
                    c(i,j) = c(i,j) + a(i,k) * b(k,j)
                enddo
            enddo
        enddo
        !$omp end target teams
! ++++
        CALL CPU_TIME ( time_end_1 ) ! aka time_begin_2
        ! serial compute matrix multiplication
        do j=1,n
            do i=1,n
                do k=1,n
                    c_serial(i,j) = c_serial(i,j) + a(i,k) * b(k,j)
                enddo
            enddo
        enddo
! ++++
        CALL CPU_TIME ( time_end_2 )
        ! verify result
        do j=1,n
            do i=1,n
                if (c_serial(i,j) .ne. c(i,j)) then
                    print *,'FAILED, i, j, c_serial(i,j), c(i,j) ', i, j, c_serial(i,j), c(i,j)
                    exit
                endif
            enddo
        enddo

        write(*,50)i1
50      Format('The test PASSED --- ok so this is the 'i3,'   wait for the tenth step!')
! ----  CALL CPU_TIME ( time_end )
        WRITE (*,200)i1, time_end_1 - time_begin, time_end_2 - time_end_1
        if(i1 .eq. 4 .and. (time_end - time_begin .gt. 0.1)) then
            write(*,300)
300         Format(//"                  You crack me up, please it will not finish in your lifetime on this old computer, update human!"//"                 Can someone make a cup of tea please!",//)
200         Format('Loop :: ',i3'  Time of operation was ',2F10.3 , ' seconds')
        endif
        DEALLOCATE(A, B, C, C_Serial,STAT = ALLOC_ERR)
100 end do

    end program matrix_multiply

Jim Dempsey

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,721 Views

Also, keep in mind that the 1st call to an offload region "may" have additional overhead in injecting the code into the GPU, and potentially performing the JIT (Just In Time) compilation of the pre-processed (intermediary) code.

As an experiment, enclose the DO 100 loop with a

DO 101 Pass=1,2

  print *,"Pass ", Pass
  ... !  DO 100 loop

101 END DO

 

At the small sizes, the serial version should benefit from the L1 and L2 cache locality (at least for the CPU code). Not sure how the cache relates to the GPU code on the integrated GPU of the NUC. IOW are NUC GPU "cores" part of the CPU cores sharing the L1 and L2 caches or independent cores with their own cache (or lack of cache thereof).

 

Jim Dempsey

JohnNichols
Valued Contributor III
1,665 Views

Jim:

I will try your suggestion, it may take me a few hours, the weather is hot and I am stuck in FEM.  

The sort of models are all linear. Pity. 

John

0 Kudos
Barbara_P_Intel
Employee
1,652 Views

@JohnNichols, I'm glad you're testing the matmul and learning some things! Remove line 40 in your version. That was my error to have it there.

This is a sample code to demonstrate offloading. I haven't done any performance work with on CPU.

BTW... how are you compiling? Add /Qopenmp to have the !$OMP directives recognized and, at runtime, run on more than one thread. 

 

0 Kudos
JohnNichols
Valued Contributor III
1,645 Views
 Number of CPU procs is            1
The test PASSED --- ok so this is the   1   wait for the tenth step!
Loop ::   1  Time of operation was      0.000     0.000 seconds
 matrix size           10
 Number of CPU procs is            1
The test PASSED --- ok so this is the   2   wait for the tenth step!
Loop ::   2  Time of operation was      0.000     0.000 seconds
 matrix size           40
 Number of CPU procs is            1
The test PASSED --- ok so this is the   3   wait for the tenth step!
Loop ::   3  Time of operation was      0.000     0.000 seconds
 matrix size          100
 Number of CPU procs is            1
The test PASSED --- ok so this is the   4   wait for the tenth step!
Loop ::   4  Time of operation was      0.078     0.125 seconds


                  You crack me up, please it will not finish in your lifetime on this old computer, update human!

                 Can someone make a cup of tea please!


 matrix size          200
 Number of CPU procs is            1
The test PASSED --- ok so this is the   5   wait for the tenth step!
Loop ::   5  Time of operation was      0.516     0.594 seconds
 matrix size          400
 Number of CPU procs is            1
The test PASSED --- ok so this is the   6   wait for the tenth step!
Loop ::   6  Time of operation was      4.188     3.156 seconds
 matrix size          700
 Number of CPU procs is            1
The test PASSED --- ok so this is the   7   wait for the tenth step!
Loop ::   7  Time of operation was     23.297    14.734 seconds
 matrix size         1000
 Number of CPU procs is            1
The test PASSED --- ok so this is the   8   wait for the tenth step!
Loop ::   8  Time of operation was     70.781    43.172 seconds
 matrix size         1500
 Number of CPU procs is            1
The test PASSED --- ok so this is the   9   wait for the tenth step!
Loop ::   9  Time of operation was    254.828   147.328 seconds
 matrix size         2000
 Number of CPU procs is            1
The test PASSED --- ok so this is the  10   wait for the tenth step!
Loop ::  10  Time of operation was    628.922   379.703 seconds
 matrix size         2600
 Number of CPU procs is            1
The test PASSED --- ok so this is the  11   wait for the tenth step!
Loop ::  11  Time of operation was   1429.422   864.047 seconds
Press any key to continue . . .

@jimdempseyatthecove , here are the results, note there are 11 steps - I added one --  how do you want to graph it?

 

@Barbara_P_Intel , thanks for the comment, I added the omp directive using the VS check box.  Interestingly if I move to release it will not run.  

 

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,630 Views

You have the data to chart 4 lines: two passes, each method.

 

BTW, a formal test of matmul would be to add Fortran OpenMP code as well as having a serial thread call the MKL threaded library to perform the matmul (highly optimized).

 

This would give one some guidance as to the method to use for the platform available. 

 

Jim Dempsey

0 Kudos
JohnNichols
Valued Contributor III
1,594 Views

Jim:

First graph adds the data to my earlier graph, with an R squared of 1 it is close to exact cubic

JohnNichols_0-1689715474988.png

This next graph shows the relationship -- dead linear

first set is slow and second set is faster at 60% of the run time

JohnNichols_1-1689715561151.png

 

I have some 3 D data that has some interesting properties, have you seen a Fortran program to fit a least squares surface to data?  

 

 

 

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,574 Views

>>first set is slow and second set is faster at 60% of the run time

Which is half expected as I placed the CPU_TIME's in the wrong place, as well as the JIT (if any) and/or code injection overhead should have occurred once on pass 1 of array size=4x4.

The pass loop should have been replicated to encapsulate each matmul method. In an actual application, it is likely that the matmul method would be called more than once, as well as many times on the same allocated memory. It is important to know the 

Do once time

vs

Do many times

 

Jim Dempsey

0 Kudos
JohnNichols
Valued Contributor III
1,559 Views

Where do you want to move the CPU time to?  

 

Further to the 3D data. There appears to be a nice temperature dependence in my data.  After further thinking about it last night, I can trace the linear regression along an isotherm, which is all the points bar 1.  

I can estimate the point at the different temperature and then look at the difference in the actual and the estimated, to get a guess at the temperature impact.  

Not highly accurate but it is one way. 

0 Kudos
Reply