- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
hi,
i was creating a simple benchmark program and i ran into something really weird with ifx. i was getting up to 1.5 gflops using ifort. with ifx it drops way down to around .2 gflops. on other codes i wrote, the performance was about the same or a little better with ifx. so this is really odd.
c7.bat and c8.bat are the ifx compiler options i tried. c.bat is for sse2. the rest are avx-512. the a added to the name was to try some compiler options related to memory.
i attached the program. it's Fortran 77. however, i ran into the windows 2gb memory limit. there was odd performance behavior with the size of the matrix as well. 1000x1000 seems to work best. my cpu has a lot of variability. so i have it repeat the tests 5 times.
not sure if it matters, but my cpu is the intel core i5-1135g7. i have windows 10 home edition, which is up to date with patches etc.
there are a number of limitations with the program, due to my lack of programming skills. i have no doubt it could be made a lot better. it does what i need though. it also helps me to find the best compiler options.
anthony
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
ifx is not as mature as ifort. Many of the optimizations implemented in ifort have yet to be implemented in ifx.
The object files of ifort are compatible with the object files of ifx. This means that by using VTune, you can identify the lesser optimized code (under-performing code) in the ifx code, and then split your solution (or equivalent Linux assembly) into two libraries to be linked with an ifx main program. One of the libraries to be built by ifx and the other by ifort. Note, the ifort compiled code cannot contain features provided by ifx (!$omp targed ...).
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I installed ONEAPI including the HPC yesterday on a computer, and then this morning there is a new update?
It might be worth trying?
Seems strange.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
oh, i greatly screwed up the math and the calculation of gflops. so don't use the code as is for anything substaintial. however, it still works to show the weird performance drop from ifort to ifx.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Here is a modified version of Anthony's test program, which may make it easier for someone to investigate why the EXE generated by ifort is so much faster than the one generated by ifx. The test program multiplies a lower triangular 2000 X 2000 matrix with a vector 5000 times, varying the nonzero values from one iteration to the next.
On a PC with an i7-10710U CPU, the former runs in 2.8 s, whereas the latter runs in 21.5 s; the ratio is 7.7. I used the current versions of the compilers (ifort 2021.8.0 and ifx 2023.0.0) with command line options /traceback /MD /Qxhost /fast .
I also tested a version (not shown here) in which the innermost DO loops were replaced by array assignment, and for that version both compilers produced EXEs that took about 20 seconds to run.
program fpubench
implicit none
integer, parameter :: maxsiz = 2000, nrep = 5000
real*8 a(maxsiz,maxsiz), b(maxsiz), c(maxsiz,maxsiz)
integer dinfo, i, j, mtype, nsiz, nl, irep, rl
real*8 cputimesecs, finish, start, gflop, gflops
character*12 :: stars = "(1x,80('*'))"
mtype = 1
dinfo = 1
nsiz = maxsiz
print stars
write (*,'(1x,a,i0,a,i0)') 'Running ',nrep, &
' times with Matrix size ', nsiz
! Begin Computational Kernel
call cpu_time(start)
do irep = 1, nrep
do j = 1, nsiz
b(j) = dble(j)
do i = 1, nsiz
if ( i < j ) then
a(i,j) = 0.0d0
c(i,j) = 0.0d0
else
a(i,j) = dble(2*i+3*j) + 0.005*irep
c(i,j) = a(i,j)*b(j)
endif
enddo
enddo
!call check(nsiz, c)
enddo
call cpu_time(finish)
call check(nsiz, c)
! End Computational Kernel
cputimesecs = finish - start
gflop = real(nsiz*nsiz,8)*nrep/1.0d9
if ( cputimesecs/=0.0d0 ) then
gflops = gflop/cputimesecs
else
gflops = -999.0
endif
write (*,*) 'lower triangular matrix'
write (*,'(3(1x,a,g12.3))') 'a = ', a(1,1), 'b = ', b(1), 'c = ', c(1,1)
write (*,'(3(1x,a,g12.3))') 'a = ', a(nsiz,1), 'b = ', b(1), 'c = ', c(nsiz,1)
write (*,'(1x,a,f10.3)') 'total number of floating point calcs (gflop): ', gflop
write (*,'(1x,a,f10.3)') 'total cpu time (seconds); ', cputimesecs
write (*,'(1x,a,f10.3)') 'floating point ops per sec (gflops); ', gflops
print stars
write (*,*) 'fpu_bench has finished running'
end program
subroutine check(n,c)
implicit none
integer, intent(in) :: n
real*8, intent(in) :: c(n,n)
integer i,j,nrep
nrep = 5000
do j = 1, n
do i = 1, n
if(j > i)then
if(c(i,j) /= 0.0d0)print '(1x,2i5,es12.4)',i,j,c(i,j)
else
if(c(i,j) /= dble((2*i+3*j++ 0.005*nrep)*j))print '(1x,2i5,es12.4)',i,j,c(i,j)
endif
enddo
enddo
return
end subroutine
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Trying the program on my DELL with an additional timing for the overall program.
program fpubench
implicit none
integer, parameter :: maxsiz = 2000, nrep = 5000
real*8 a(maxsiz,maxsiz), b(maxsiz), c(maxsiz,maxsiz)
integer dinfo, i, j, mtype, nsiz, nl, irep, rl
real*8 cputimesecs, finish, start, gflop, gflops
character*12 :: stars = "(1x,80('*'))"
mtype = 1
dinfo = 1
nsiz = maxsiz
print stars
call DATTIM(0)
write (*,'(1x,a,i0,a,i0)') 'Running ',nrep, &
' times with Matrix size ', nsiz
! Begin Computational Kernel
call cpu_time(start)
do irep = 1, nrep
!if(mod(irep,1000) .eq. 1) then
!write(*,*)irep
!endif
do j = 1, nsiz
b(j) = dble(j)
do i = 1, nsiz
if ( i < j ) then
a(i,j) = 0.0d0
c(i,j) = 0.0d0
else
a(i,j) = dble(2*i+3*j) + 0.005*irep
c(i,j) = a(i,j)*b(j)
endif
enddo
enddo
!call check(nsiz, c)
enddo
call cpu_time(finish)
call check(nsiz, c)
! End Computational Kernel
cputimesecs = finish - start
gflop = real(nsiz*nsiz,8)*nrep/1.0d9
if ( cputimesecs/=0.0d0 ) then
gflops = gflop/cputimesecs
else
gflops = -999.0
endif
write (*,*) 'lower triangular matrix'
write (*,'(3(1x,a,g12.3))') 'a = ', a(1,1), 'b = ', b(1), 'c = ', c(1,1)
write (*,'(3(1x,a,g12.3))') 'a = ', a(nsiz,1), 'b = ', b(1), 'c = ', c(nsiz,1)
write (*,'(1x,a,f10.3)') 'total number of floating point calcs (gflop): ', gflop
write (*,'(1x,a,f10.3)') 'total cpu time (seconds); ', cputimesecs
write (*,'(1x,a,f10.3)') 'floating point ops per sec (gflops); ', gflops
print stars
write (*,*) 'fpu_bench has finished running'
call DATTIM(1)
end program
subroutine check(n,c)
implicit none
integer, intent(in) :: n
real*8, intent(in) :: c(n,n)
integer i,j,nrep
nrep = 5000
do j = 1, n
do i = 1, n
if(j > i)then
if(c(i,j) /= 0.0d0)print '(1x,2i5,es12.4)',i,j,c(i,j)
else
if(c(i,j) /= dble((2*i+3*j++ 0.005*nrep)*j))print '(1x,2i5,es12.4)',i,j,c(i,j)
endif
enddo
enddo
return
end subroutine
! ****************************************************************
!
SUBROUTINE DATTIM(I)
!
! ***************************************************************
Implicit none
CHARACTER CH
INTEGER*2 IHR,IMIN,ISEC,I100TH,ISEC1,I100TH1
INTEGER*2 IMIN_OLD,ISEC_OLD,I100TH_OLD
INTEGER*2 IMIN_TOT,ISEC_TOT,I100TH_TOT
INTEGER I
COMMON /TIMER1/ IMIN_OLD,ISEC_OLD
CH=CHAR(32)
CALL GETTIM (IHR,IMIN,ISEC,I100TH)
IF(I .EQ. 0) THEN
IMIN_OLD = IMIN
ISEC_OLD = ISEC
I100TH_OLD = I100TH
ENDIF
IF(I .EQ. 1) THEN
IF(IMIN .LT. IMIN_OLD) THEN
IMIN = IMIN+60
IHR = IHR - 1
ENDIF
IMIN_TOT = IMIN - IMIN_OLD
IF(I100TH .LT. I100TH_OLD) THEN
I100TH1 = I100TH+100
ISEC = ISEC - 1
ELSE
I100TH1 = I100TH
ENDIF
I100TH_TOT = I100TH1 - I100TH_OLD
IF(ISEC .LT. ISEC_OLD) THEN
ISEC1 = ISEC+60
IMIN_TOT = IMIN_TOT - 1
ELSE
ISEC1 = ISEC
ENDIF
ISEC_TOT = ISEC1 - ISEC_OLD
WRITE(*,'(1X,1A,A15\)')'Elapsed Time : '
WRITE(*,'( I2.2,1H:,I2.2,1H:,I2.2)')IMIN_TOT,ISEC_TOT,I100TH_TOT
WRITE(*,'(1X,A15\)')'Start Time : '
WRITE(*,'( I2.2,1H:,I2.2,1H:,I2.2)')IMIN,ISEC,I100TH
WRITE(*,'(1X,A15\)')'Finish Time : '
WRITE(*,'( I2.2,1H:,I2.2,1H:,I2.2)')IMIN_OLD,ISEC_OLD,I100TH_OLD
WRITE(*,'(1X,A15\)')'Ratio Time : '
ENDIF
RETURN
END
Adding an old timer, I get on a Dell CORE i7 11th Gen Intel(R) Core(TM) i7-11850H @ 2.50GHz 2.50 GHz
Core I7 10710U has a pass mark score of 9848
Core I7 11850H has a pass mark score of 20,867
I am running 64 - release with Windows 11 Preview. I got one at 3.9 seconds, but it had a high overall so I discarded it.
I am running 2023 Fortran
IFX runs in 22+ seconds.
I wonder if W11 is the difference, your machine is a bit faster.
Running your program on IFORT is 4.5 seconds.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
For some reason, the forum will not let me edit my previous post. Please change j++ in line 65 to j+ .
Since a couple of days, I have noticed that the code display uses interspersedfonts of more than one size, and that the line number display has a spacing slightly larger than the spacing between lines of code, all of which make life more exciting than necessary.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The timings shown by JohnNichols confirm that for this program the IFX compiled EXE takes many times (~5) longer than the EXE compiled with IFORT.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Are you running on LINUX? This might explain the difference between your results and mine.
The interesting thing to explain to masters students is that experiments need to be repeated, there is no point doing a Bridge to Far exercise to find out the 82nd stuffed it at Nijmegen and you are hanging out in enemy land.
I define enemy land as any reviewer with a chip on their shoulder. ie all reviewers.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I ran it in the CMD Window for IFORT 64 bit, it ran 0.3 seconds slower on average.
How did you get to 2.8 seconds?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
hi,
just a reminder that the code i posted was to highlight an oddity with ifx. I WOULD NOT USE IT TO ACTUALLY BENCHMARK ANYTHING. i spent another few days on it. however, it's a lost cause. i don't know if any of the mods people did made it so you could use it. the code i posted and the edits i made definitely are not making any sense.
i thought it was going to be fairly simple to make something to measure peak GFLOPS. however, that didn't turn out to be the case. so i'm not going to worry about it. it's not worth the trouble.
anthony
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
This forum answers interesting questions, but like any group it can be interested in the elements of the problem that are of no interest to the original question. Your stuff has morphed, it happens, just watch the interest, you can learn a lot, I know I learn more from these guys than anywhere else.
Now I am interested, vaguely in why @mecej4 computer solves it in 2.8 seconds and I am about double that on average. Based on the theoretical stats it should not be that way, which means a interesting problem of about Winnie the Pooh level of importance.
JMN
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Nothing special. I compile using
ifort /fast /Qxhost /MD fpubench.f90
I run the resulting EXE once or twice so that the runtime DLLs are located and loaded. I time the next run:
S:\FPUBEN>c:\dos\timethis fpubench.exe
********************************************************************************
Running 5000 times with Matrix size 2000
lower triangular matrix
a = 30.0 b = 1.00 c = 30.0
a = 0.403E+04 b = 1.00 c = 0.403E+04
total number of floating point calcs (gflop): 20.000
total cpu time (seconds); 2.812
floating point ops per sec (gflops); 7.111
********************************************************************************
fpu_bench has finished running
NSNUC-i7V10 Elapsed time: 2.883 s
Command : fpubench.exe
S: Mon Apr 3 16:53:50 2023
F: Mon Apr 3 16:53:53 2023
Note that the timing measurement inside the program itself (using CPU_TIME) is not very accurate; here it is 2.812 seconds. The last four lines of the output are printed by the "timethis" utility program, which uses the Windows QueryPerformanceCounter API. The time reported is about 0.07 second higher, because it includes program startup and termination times.
I tried with the additional option /Qparallel, and got an elapsed time of 1.2 s. My PC is set to run on the Balanced power plan. I have turned off a few utilities that keep checking the web for updates, such as Adobe, Google, the Intel DSA tool, etc.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
yeah, the timing is a huge problem. i made a different thread about that. can you post how to do an accurate timing, either to this thread or the other. i'm not seeing how you did that in the code you posted earlier.
understood, i just didn't want anyone thinking that the code I posted worked. it doesn't. i can't speak for anyone's mods though. i've tried everything i can but haven't got the correct results, so far. i think it's a time measurement issue. i had made a different thread about that. if i do ever get it working, i can post an update. right now, the version of the code i currently have is off roughly 5x. i no longer have the original code i posted. but i'm sure it's off too. just not sure by how much. i changed the math to be like what i was intended. the intent was to maximize the fpu operation. on my processor that's supposed to be 32 flop per cycle. when i do that, i'm 5x lower than i should be. i'm sure the processor is right. other benchmarks report the correct number.
ps
the intel forum is terrible. as i try to type this reply, it keeps jumping and scrolling to odd locations. it's **bleep** near impossible to use.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@prop_design wrote:
yeah, the timing is a huge problem. i made a different thread about that. can you post how to do an accurate timing, either to this thread or the other. i'm not seeing how you did that in the code you posted earlier.
Here is the source for a utility program, "timethis", adapted from source code published on MS web pages over a decade ago. Compile and link with one of the following commands:
icl /O2 /MD timethis.c ws2_32.lib kernel32.lib
cl /O2 /MD timethis.c ws2_32.lib kernel32.lib
If the program that you wish to time takes arguments, wrap quotes around the command line, as in:
timethis "MyProg.exe arg1 arg2"
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@mecej4 thanks for sharing what you did. unfortunately, that's way over my head. i tried to study it some and ended up finding an old forum post on here. i was able to use that to get the time thing working. however, it still didn't fix the problem i'm having at the moment. so i'll have to look for other possible problems. in any event, i appreciate the help. i'll post what i did on the other tread, as far as the time hack for fortran to windows.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Ran some tests
2023.0.0 package compilers ifort 2021.8.0 ifx 2023.0.0
Base options for both:
-O2 -xhost -align array64byte
ifort: 3.1sec
ifx: 21.6sec. !woof
Switching to the 2023.1.0 compiler ifort 2021.9.0. ifx 2023.1.0
ifort: 3.1sec
ifx: 21.0sec !still woof
I agree 7x slower is unacceptable. If I had to guess, and I'll confirm this, IFX is not creating efficient masked vector operations for the IF inside the loop. Off the top of my head, I can't think of anything else that would cause such a discrepancy.
I'll open a bug report
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks, Ron. As part of investigating this performance bug, it may be worthwhile to investigate a related performance issue, i.e., why replacing the innermost DO loop of the computational kernel with array assignments leads to slowing down by a factor of 5 to 6 (with Ifort as well as with Ifx), as I alluded to earlier. Here is the array-assignment version of the same test code. Thanks.
program fpubench
implicit none
integer, parameter :: maxsiz = 2000, nrep = 5000, dp = kind(0.0d0)
double precision a(maxsiz,maxsiz), b(maxsiz), c(maxsiz,maxsiz)
integer i, j, nsiz, irep
double precision cputimesecs, finish, start, gflop, gflops
character*12 :: stars = "(1x,80('*'))"
nsiz = maxsiz
print stars
write (*,'(1x,a,i0,a,i0)') 'Running ',nrep, &
' times with Matrix size ', nsiz
! Begin Computational Kernel
call cpu_time(start)
do irep = 1, nrep
do j = 1, nsiz ! innermost DO loop replaced by array assignments
b(j) = dble(j)
a(1:j-1,j) = 0.0d0
c(1:j-1,j) = 0.0d0
a(j:nsiz,j) = [(dble(2*i+3*j),i=j,nsiz)] + 0.005*irep
c(j:nsiz,j) = a(j:nsiz,j)*b(j)
enddo
!call check(nsiz, c)
enddo
call cpu_time(finish)
call check(nsiz, c)
! End Computational Kernel
cputimesecs = finish - start
gflop = real(nsiz*nsiz,dp)*nrep/1.0d9
if ( cputimesecs/=0.0d0 ) then
gflops = gflop/cputimesecs
else
gflops = -999.0
endif
write (*,*) 'lower triangular matrix'
write (*,'(3(1x,a,g12.3))') 'a = ', a(1,1), 'b = ', b(1), 'c = ', c(1,1)
write (*,'(3(1x,a,g12.3))') 'a = ', a(nsiz,1), 'b = ', b(1), 'c = ', c(nsiz,1)
write (*,'(1x,a,f10.3)') 'total number of floating point calcs (gflop): ', gflop
write (*,'(1x,a,f10.3)') 'total cpu time (seconds); ', cputimesecs
write (*,'(1x,a,f10.3)') 'floating point ops per sec (gflops); ', gflops
print stars
write (*,*) 'fpu_bench has finished running'
end program
subroutine check(n,c)
implicit none
integer, intent(in) :: n
double precision, intent(in) :: c(n,n)
integer i,j,nrep
nrep = 5000
do j = 1, n
do i = 1, n
if(j > i)then
if(c(i,j) /= 0.0d0)print '(1x,2i5,es12.4)',i,j,c(i,j)
else
if(c(i,j) /= dble((2*i+3*j+0.005*nrep)*j))then
print '(1x,2i5,es12.4)',i,j,c(i,j)
endif
endif
enddo
enddo
return
end subroutine
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
So this thread took a bit of a tangent. In that regard, I'm attaching what I believe is a working version of the fpu benchmarking program that I created. I didn't test it as far as the ifx slowdown issue. I think the original code I posted shows that well. This update is meant to show the correct GFLOPS numbers. You may have to tweak some of the internal variables and spreadsheet calculations, if you're cpu is substantially different than mine. However, I think this will work for a lot of processors that support AVX-512.
Update:
I did some quick tests comparing compilers and this was the result. On my actual codes, I don't see this type of slowdown. So I was surprised to see this on such a simple code.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page