- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Barbara:
You put up a simple OMP Program. I tried it on the NUC and it did not complete, I thought it was throwing an error, but that was my mistake.
I amended the program so it gives you some idea of how it can perform.
program matrix_multiply
use omp_lib
implicit none
integer :: i, j, k, myid, m, n,i1
integer alloc_err
real(8), allocatable, dimension(:,:) :: a, b, c, c_serial
REAL time_begin, time_end
do 100 i1 = 1, 11
if(i1 .eq. 1) then
n = 4
else if(i1 .eq. 2) then
n = 10
else if(i1 .eq. 3) then
n = 40
else if(i1 .eq. 4) then
n = 100
else if(i1 .eq. 5) then
n = 200
else if(i1 .eq. 6) then
n = 400
else if(i1 .eq. 7) then
n = 700
else if(i1 .eq. then
n = 1000
else if(i1 .eq. 9) then
n = 1500
else if(i1 .eq. 10) then
n = 2000
else if(i1 .eq. 11) then
n = 2600
end if
CALL CPU_TIME ( time_begin )
myid = OMP_GET_THREAD_NUM()
if (myid .eq. 0) then
print *, 'matrix size ', n
print *, 'Number of CPU procs is ', OMP_GET_NUM_THREADS()
else
print *, 'Number of OpenMP Device Available:', omp_get_num_devices()
!$omp target
if (OMP_IS_INITIAL_DEVICE()) then
print *, ' Running on CPU'
else
print *, ' Running on GPU'
endif
!$omp end target
endif
allocate( a(n,n), b(n,n), c(n,n), c_serial(n,n))
! Initialize matrices
do j=1,n
do i=1,n
a(i,j) = i + j - 1
b(i,j) = i - j + 1
enddo
enddo
c = 0.0
c_serial = 0.0
!$omp target teams map(to: a, b) map(tofrom: c)
!$omp distribute parallel do SIMD private(j, i, k)
! parallel compute matrix multiplication.
do j=1,n
do i=1,n
do k=1,n
c(i,j) = c(i,j) + a(i,k) * b(k,j)
enddo
enddo
enddo
!$omp end target teams
! serial compute matrix multiplication
do j=1,n
do i=1,n
do k=1,n
c_serial(i,j) = c_serial(i,j) + a(i,k) * b(k,j)
enddo
enddo
enddo
! verify result
do j=1,n
do i=1,n
if (c_serial(i,j) .ne. c(i,j)) then
print *,'FAILED, i, j, c_serial(i,j), c(i,j) ', i, j, c_serial(i,j), c(i,j)
exit
endif
enddo
enddo
write(*,50)i1
50 Format('The test PASSED --- ok so this is the 'i3,' wait for the tenth step!')
CALL CPU_TIME ( time_end )
WRITE (*,200)i1, time_end - time_begin
if(i1 .eq. 4 .and. (time_end - time_begin .gt. 0.1)) then
write(*,300)
300 Format(//" You crack me up, please it will not finish in your lifetime on this old computer, update human!"//" Can someone make a cup of tea please!",//)
200 Format('Loop :: ',i3' Time of operation was ',F10.3 , ' seconds')
endif
DEALLOCATE(A, B, C, C_Serial,STAT = ALLOC_ERR)
100 end do
end program matrix_multiply
The program runs on a single processor in cubic n, so by the time you get to 2600 it takes 38 minutes to run.
I enclose a graph, I ran it three times, first as your program and then twice as the modified.
Anyway, it was fun.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
John,
Your CPU time interval includes:
a) Identification as to if CPU or GPU
b) allocation of arrays
c) initialization of arrays
d) the (potentially) parallel matrix multiplication
e) the serial matrix multiplication
f) the verify
As opposed to timing each of d) and e).
Why was this done?
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Jim:
I had tried the program on my old NUC, about 6 years old, and at 2600 matrix size it seemed to never return, by a few hundred seconds, anyone is hitting keys to see what is happening. It appeared to be locked at Line 69, but I was wrong.
The program is intended as a sample, what is the point of a sample that causes one to pull out one's remaining hair, so I amended the matrix size to 4 and it was fast. So then I tried a few bigger numbers and then as the old Physics 101 cut in and I remembered Squires, I thought better time it, so I did and I got a parabola in the beginning, but it was a poor predictor, so I got longer data and moved to a cubic.
My interest was in the whole process, not the OMP. It is just to show someone who uses the sample, both the sample and the growth in time of the run time.
It was merely a suggestion for Barbara.
She must be on holidays, as she is normally a fast responder.
I will run your suggestion today.
John
I only had the NUC as I could not spare my Dell as it is doing real work that earns real money. Plus it is the worst machine anyone is likely to have at the moment.
It was just for fun in between writing an FE model generator and running said models.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Jim:
I tried your suggestion, the program ran consistently slower.
Here is the amended timing figure
The blue line is the lost time for your suggestion in centi-seconds, so I can plot it. Interestingly n = 100 does not really fit the other data and I ran it a few times to check. No idea why it is so much slower at 100.
Even if I manage to OMP the program, 2600 n in a 10 core is still going to take 2 to 3 minutes.
John
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
program matrix_multiply
use omp_lib
implicit none
integer :: i, j, k, myid, m, n,i1
integer alloc_err
real(8), allocatable, dimension(:,:) :: a, b, c, c_serial
REAL time_begin, time_end_1, time_end_2
do 100 i1 = 1, 11
if(i1 .eq. 1) then
n = 4
else if(i1 .eq. 2) then
n = 10
else if(i1 .eq. 3) then
n = 40
else if(i1 .eq. 4) then
n = 100
else if(i1 .eq. 5) then
n = 200
else if(i1 .eq. 6) then
n = 400
else if(i1 .eq. 7) then
n = 700
else if(i1 .eq. then
n = 1000
else if(i1 .eq. 9) then
n = 1500
else if(i1 .eq. 10) then
n = 2000
else if(i1 .eq. 11) then
n = 2600
end if
! --- CALL CPU_TIME ( time_begin ) ! do not time the initialization
myid = OMP_GET_THREAD_NUM()
if (myid .eq. 0) then
print *, 'matrix size ', n
print *, 'Number of CPU procs is ', OMP_GET_NUM_THREADS()
else
print *, 'Number of OpenMP Device Available:', omp_get_num_devices()
!$omp target
if (OMP_IS_INITIAL_DEVICE()) then
print *, ' Running on CPU'
else
print *, ' Running on GPU'
endif
!$omp end target
endif
allocate( a(n,n), b(n,n), c(n,n), c_serial(n,n))
! Initialize matrices
do j=1,n
do i=1,n
a(i,j) = i + j - 1
b(i,j) = i - j + 1
enddo
enddo
c = 0.0
c_serial = 0.0
! +++++++ time just the matmul
CALL CPU_TIME ( time_begin )
!$omp target teams map(to: a, b) map(tofrom: c)
!$omp distribute parallel do SIMD private(j, i, k)
! parallel compute matrix multiplication.
do j=1,n
do i=1,n
do k=1,n
c(i,j) = c(i,j) + a(i,k) * b(k,j)
enddo
enddo
enddo
!$omp end target teams
! ++++
CALL CPU_TIME ( time_end_1 ) ! aka time_begin_2
! serial compute matrix multiplication
do j=1,n
do i=1,n
do k=1,n
c_serial(i,j) = c_serial(i,j) + a(i,k) * b(k,j)
enddo
enddo
enddo
! ++++
CALL CPU_TIME ( time_end_2 )
! verify result
do j=1,n
do i=1,n
if (c_serial(i,j) .ne. c(i,j)) then
print *,'FAILED, i, j, c_serial(i,j), c(i,j) ', i, j, c_serial(i,j), c(i,j)
exit
endif
enddo
enddo
write(*,50)i1
50 Format('The test PASSED --- ok so this is the 'i3,' wait for the tenth step!')
! ---- CALL CPU_TIME ( time_end )
WRITE (*,200)i1, time_end_1 - time_begin, time_end_2 - time_end_1
if(i1 .eq. 4 .and. (time_end - time_begin .gt. 0.1)) then
write(*,300)
300 Format(//" You crack me up, please it will not finish in your lifetime on this old computer, update human!"//" Can someone make a cup of tea please!",//)
200 Format('Loop :: ',i3' Time of operation was ',2F10.3 , ' seconds')
endif
DEALLOCATE(A, B, C, C_Serial,STAT = ALLOC_ERR)
100 end do
end program matrix_multiply
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Also, keep in mind that the 1st call to an offload region "may" have additional overhead in injecting the code into the GPU, and potentially performing the JIT (Just In Time) compilation of the pre-processed (intermediary) code.
As an experiment, enclose the DO 100 loop with a
DO 101 Pass=1,2
print *,"Pass ", Pass
... ! DO 100 loop
101 END DO
At the small sizes, the serial version should benefit from the L1 and L2 cache locality (at least for the CPU code). Not sure how the cache relates to the GPU code on the integrated GPU of the NUC. IOW are NUC GPU "cores" part of the CPU cores sharing the L1 and L2 caches or independent cores with their own cache (or lack of cache thereof).
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Jim:
I will try your suggestion, it may take me a few hours, the weather is hot and I am stuck in FEM.
The sort of models are all linear. Pity.
John
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@JohnNichols, I'm glad you're testing the matmul and learning some things! Remove line 40 in your version. That was my error to have it there.
This is a sample code to demonstrate offloading. I haven't done any performance work with on CPU.
BTW... how are you compiling? Add /Qopenmp to have the !$OMP directives recognized and, at runtime, run on more than one thread.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Number of CPU procs is 1
The test PASSED --- ok so this is the 1 wait for the tenth step!
Loop :: 1 Time of operation was 0.000 0.000 seconds
matrix size 10
Number of CPU procs is 1
The test PASSED --- ok so this is the 2 wait for the tenth step!
Loop :: 2 Time of operation was 0.000 0.000 seconds
matrix size 40
Number of CPU procs is 1
The test PASSED --- ok so this is the 3 wait for the tenth step!
Loop :: 3 Time of operation was 0.000 0.000 seconds
matrix size 100
Number of CPU procs is 1
The test PASSED --- ok so this is the 4 wait for the tenth step!
Loop :: 4 Time of operation was 0.078 0.125 seconds
You crack me up, please it will not finish in your lifetime on this old computer, update human!
Can someone make a cup of tea please!
matrix size 200
Number of CPU procs is 1
The test PASSED --- ok so this is the 5 wait for the tenth step!
Loop :: 5 Time of operation was 0.516 0.594 seconds
matrix size 400
Number of CPU procs is 1
The test PASSED --- ok so this is the 6 wait for the tenth step!
Loop :: 6 Time of operation was 4.188 3.156 seconds
matrix size 700
Number of CPU procs is 1
The test PASSED --- ok so this is the 7 wait for the tenth step!
Loop :: 7 Time of operation was 23.297 14.734 seconds
matrix size 1000
Number of CPU procs is 1
The test PASSED --- ok so this is the 8 wait for the tenth step!
Loop :: 8 Time of operation was 70.781 43.172 seconds
matrix size 1500
Number of CPU procs is 1
The test PASSED --- ok so this is the 9 wait for the tenth step!
Loop :: 9 Time of operation was 254.828 147.328 seconds
matrix size 2000
Number of CPU procs is 1
The test PASSED --- ok so this is the 10 wait for the tenth step!
Loop :: 10 Time of operation was 628.922 379.703 seconds
matrix size 2600
Number of CPU procs is 1
The test PASSED --- ok so this is the 11 wait for the tenth step!
Loop :: 11 Time of operation was 1429.422 864.047 seconds
Press any key to continue . . .
@jimdempseyatthecove , here are the results, note there are 11 steps - I added one -- how do you want to graph it?
@Barbara_P_Intel , thanks for the comment, I added the omp directive using the VS check box. Interestingly if I move to release it will not run.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You have the data to chart 4 lines: two passes, each method.
BTW, a formal test of matmul would be to add Fortran OpenMP code as well as having a serial thread call the MKL threaded library to perform the matmul (highly optimized).
This would give one some guidance as to the method to use for the platform available.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Jim:
First graph adds the data to my earlier graph, with an R squared of 1 it is close to exact cubic
This next graph shows the relationship -- dead linear
first set is slow and second set is faster at 60% of the run time
I have some 3 D data that has some interesting properties, have you seen a Fortran program to fit a least squares surface to data?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>first set is slow and second set is faster at 60% of the run time
Which is half expected as I placed the CPU_TIME's in the wrong place, as well as the JIT (if any) and/or code injection overhead should have occurred once on pass 1 of array size=4x4.
The pass loop should have been replicated to encapsulate each matmul method. In an actual application, it is likely that the matmul method would be called more than once, as well as many times on the same allocated memory. It is important to know the
Do once time
vs
Do many times
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Where do you want to move the CPU time to?
Further to the 3D data. There appears to be a nice temperature dependence in my data. After further thinking about it last night, I can trace the linear regression along an isotherm, which is all the points bar 1.
I can estimate the point at the different temperature and then look at the difference in the actual and the estimated, to get a guess at the temperature impact.
Not highly accurate but it is one way.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page