OpenMP & clockx()

bendel_boy1 · ‎02-26-2009

Running the example given here - http://www.polyhedron.com/openmp - which claims it is for Intel Fortran I get the serial version to run OK, but when I then enable OpenMP I get a message that the timing routine CLOCKX cannot be found. Does OpenMP inhibit the use of the portability module IFLIB? What alternate should I try?

Steven_L_Intel1 · ‎02-26-2009

I tried this with 11.0.072 and had no problems building the program. What is the exact error you get? Can you attach the build log (see below for attach instructions)?

bendel_boy1 · ‎02-26-2009

------ Build started: Project: OpenMPTest, Configuration: Debug|Win32 ------

Deleting intermediate files and output files for project 'OpenMPTest', configuration 'Debug|Win32'.
Compiling with Intel Fortran 9.1 C:Program FilesIntelCompilerFortran9.1IA32...
ifort /nologo /Zi /Od /Qparallel /fpscomp:nolibs /module:"Debug/" /object:"Debug/" /traceback /check:bounds /libs:static /dbglibs /c /Qvc7.1 /Qlocation,link,"D:Program FilesMicrosoft Visual Studio .NET 2003Vc7bin" "d:My DocumentsVisual Studio ProjectsOpenMPTestOpenMPTestOpenMPTest.f90"
Linking...
Link /OUT:"Debug/OpenMPTest.exe" /INCREMENTAL:NO /NOLOGO /DEBUG /PDB:"Debug/OpenMPTest.pdb" /SUBSYSTEM:CONSOLE "Debug/OpenMPTest.obj"
Link: executing 'link'
OpenMPTest.obj : error LNK2019: unresolved external symbol _CLOCKX referenced in function _MAIN__
Debug/OpenMPTest.exe : fatal error LNK1120: 1 unresolved externals

OpenMPTest build failed.

Source code:

[plain]program OpenMPTest
USE IFPORT
implicit none

! Variables
integer, parameter:: NumSteps = 20000000 ! 2E7
double precision:: StartTime, StopTime
double precision:: e, pi, factorial, product
integer         :: i
character       :: a*1

call clockx(StartTime)
!$OMP PARALLEL SECTIONS SHARED (e, pi)
!$OMP SECTION
print *, 'Calculation of e begun ...'
e = 1d0
factorial = 1d0
do i = 1, NumSteps
  factorial = factorial * i
  e = e + 1d0 / factorial
end do
print *, 'Calculation of e completed ... ', e
!$OMP SECTION
print *, 'Calculation of PI started ...'
pi = 0d0
do i = 0, NumSteps * 10
  pi = pi + 1d0 / (4d0 * i + 1d0) - 1d0 / (4d0 * i + 3d0)
end do    
pi = pi * 4d0
print *, 'PI calculated ...', pi
product = e * pi
!$OMP END PARALLEL SECTIONS    
call clockx(StopTime)
print *, 'Reached result ', product, ' in ', (StopTime - StartTime) / 1d6, " seconds"
print *, 'Press Q and [RETURN] to quit'
read *, a
end program OpenMPTest[/plain]

bendel_boy1 · ‎02-26-2009

Explicitly adding lipifport.lib on the linker page resolved the problem. I've no idea why the library was found under one setting (parallelisation disabled) and not the other. I didn't need to specify a path, so that wasn't the root cause.

Steven_L_Intel1 · ‎02-26-2009

You have /fpscomp:nolibs set in the project. This will prevent the portability library from being linked in.

bendel_boy1 · ‎02-27-2009

Quoting - Steve Lionel (Intel)

You have /fpscomp:nolibs set in the project. This will prevent the portability library from being linked in.

Ah! I thought that the portability library was an Intel thing, not a PowerStation thing.

Thank you.

Now for some curious things.

On an AMD Sempron the serial version takes c. 13 seconds; the parallel version c. 6 seconds. (XP as OS)
On an Intel Q6600 the serial version takes 17 seconds, the parallel version 11 seconds (Vista as OS)

Why do I see a better speedup on a machine that is single core? Is this a reflection of how bad Vista is as an OS?
(A different serial program, using VB6, saw the Q6600 running c. 8 times faster than the Sempron.)

TimP · ‎02-27-2009

Why are you comparing Fortran debug mode with VB6 optimized mode? Did you set KMP_AFFINITY? Windows 7 should be less dependent on KMP_AFFINITY than Vista, if that's what you mean by "how bad Vista is." Vista-64 usually performs OK otherwise.
True, it's usually relatively easy to get threaded performance scaling when you choose options to make your serial version as slow as possible.

bendel_boy1 · ‎02-27-2009

The VB6 program is not being compared with the Fortran. It is being compared for the performance differebnces I can expect between the Sempron & the Q6600. After a week on the Sempron I ran the same job on the Q6600, & since it was VB6 I dedicated a core to the program. (Easy to do using Task Manager when a program runs for days; not so easy with a small test that runs for 6 seconds.) After 5 days the Q6600 had completed, while the Sempron was still around half-way through. The program is not linear in compute time with progress, as it is Monte-Carlo based, but neverless the Q6600 was considerably faster than the Sempron. So I was expecting a similar result with my first foray into OpenMP: that the Sempron, a single core, would show little improvement with parallelisation, that the Q6600 would, and that the Q6600 would be faster. None of these expectations materialised. The availability of SSE2 may explain the Sempron results - I don't know enough about the Sempron & the Intel compiler to be sure. As for the Q6600 I had noticed that Vista seems to task switch between the cores excessively, resuting in inefficient usage. I was told that Vista does this to improve responsiveness for typical GUI-oriented applications, rather than numerically-intensive programs.

BUT given the screen shots on the web page from which I took the code that I have used I was, again, expecting that I woudl see better utilisation of the cores. As for KMP_AFFINITY, I haven't encountered that - I didn't see it in the Intel documentation, nor in what little I have read on OpenMP. I typed KMP_AFFINITY into the compiler help file & got a page on the MOD() function. My guess would be that KMP_AFFINITY came a little later than my version of the compiler.

I have the basics of what I wanted - my first experiment with OpenmP, and the results on two processors showing that there is a useful benefit from enabling parallelisation. What remains is curiosity about specific results.

As for Vista - I have noticed that my 32-bit Vista often behaves sluggishly DESPITE the performance meters indicating the CPU activity is low & physical free RAM is plentiful. It may be that the sluggishness is IO-bound; but opening a small Adobe document while using a newsreader are not commonly regarded as high-IO activities. This is all an aside. If there is someway of indicating that I want cores to be dedicated to a progarm programmatically, then I am here to learn.

TimP · ‎02-27-2009

Fortran and OpenMP don't provide a direct way to lock threads to cores, because supporting the various possible platforms and allowing other applications to coexist effectively becomes impractical. C/C++ programmers often advocate that.
According to the docs, the latest ifort has the /Qpar-affinity=compact option, which you would set, if you don't intend to SET KMP_AFFINITY=. This would set the behavior of ifort OpenMP or auto-parallel similar to what it is with PGI OpenMP. It will work to keep threads 0 and 1 together on one cache, and 2 and 3 on the other. The docs say the command line option over-rides the run-time environment variable, which seems the wrong way around to me. In fact, I'll submit an issue asking for an explanation. If you also set the verbose modifier, it will give some information on what it is doing. You may not like what it tells you on the AMD platform, but at least it does no harm.
The KMP_AFFINITY environment variable (with no par-affinity) goes back to ifort 10.1. You can't rely on good quad core performance with older versions.
With past ifort versions, if you wanted SSE code to run on AMD, you would have to specify it at compile time (e.g. /QxW), as well as building in release mode. It's possible that your AMD executes your debug x87 code more efficiently than the Intel CPU does.

bendel_boy1 · ‎02-28-2009

Another curio: on the Q6600, in release mode, the OpenMP takes 2.7 seconds; but the non-OpenMP version takes 0.7. The only difference between them is I have the OpenMP setting set to Disable or Generate parallel code.