Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.
Announcements
FPGA community forums and blogs on community.intel.com are migrating to the new Altera Community and are read-only. For urgent support needs during this transition, please visit the FPGA Design Resources page or contact an Altera Authorized Distributor.
29285 Discussions

Initial fortran compile and run is slow first run, 10 times faster second?

steve_o_
Beginner
1,796 Views

Hi

I have been using the trial of parallel studio 2015 and just about to purchase

I'm using ifort v15.0 to write neural networks on OS X, I've been impressed with ifort since it has been between 1.5 and 2.5 times faster with the same code than other fortran compilers like for like O3 optimised (eg backpropagation benchmark: 11s vs 27s for slowest fortran compiler, and 54s for fastest open-source  java implementation)

However,  I notice that when I compile and run the first run is 10 times slower than subsequent runs, eg run of Levenberg-Marquardt is 3.088E-002 seconds and subsequent runs are approx  1.1339E-003 seconds (command line)

I could understand this if this was java JIT but why would this happen with ifort, any ideas, am I missing something?

I'm looking to write neural net routines in a static lib  to call from xcode (objective C and Swift - I'm using these for GUI's etc)

regards

Steve

0 Kudos
1 Solution
jimdempseyatthecove
Honored Contributor III
1,796 Views

If your program allocates memory you will encounter first touch overhead on the first time you write to every page (typically 4KB or 4MB). On first touch, this generates a page fault trapping to the OS, the OS then determines if the page addressed is valid for your program's Virtual Machine, if valid, an page is assigned to the page file and/or a page of physical RAM is mapped to the virtual memory of the process. This is 1000's of time longer than a cache miss.

The usual procedure is to discard the first run, or first few runs if program working data migrates about.

For some programs that run once, consider walking the freshly (first time) allocated memory writing every page size offset. Doing so will incur the same overhead for mapping but will also avoid any cache evictions that will occur if the first touch page mapping is perform while executing the code you wish to time.

If using OpenMP then use the omp_get_time() function for timing. I even use this for timing serial code.

EDIT: omp_get_wtime()

Jim Dempsey

View solution in original post

0 Kudos
4 Replies
TimP
Honored Contributor III
1,796 Views

We've seen tests where as much as 1 second additional was spent the first time building data structures.  Filling ("warming") cache may account for time differences such as you mentioned.  It's not unusual to need to average several runs, excluding the first.

If the time command in your shell has a resolution of 1/64 Hz, you could hardly expect meaningful results for runs much under a second.  I take it you're not using the OpenMP library which has a default latency of 200 ms for closing out threads.

When comparing performance of various compilers, you need to read the docs and select equivalent optimizations, noting that ifort typically needs non-default options for correctness while others require non-default options to invoke various levels of vectorization etc.  For one thing, the unroll options of various compilers differ greatly.

0 Kudos
steve_o_
Beginner
1,796 Views

Hi Tim

I believe I was using OpenMP and I did try to read the docs for the other compilers and play around a little. I'm not complaining just curious ;-) I was using dclock() in my prog for timings. I'm not a developer these days, not have played with compilers low level stuff  for over 20 years since the days of 8086's, transputers and mainframes so I'm just getting a refresher on optimisation

eg, this is how I am compiling

ifort levenbergMarquardt.f90   -i8 -openmp -I$MKLROOT/include/ilp64 -I$MKLROOT/include  $MKLROOT/lib/libmkl_blas95_ilp64.a $MKLROOT/lib/libmkl_intel_ilp64.a $MKLROOT/lib/libmkl_core.a $MKLROOT/lib/libmkl_intel_thread.a -lpthread -lm -march=native -O3  -o levenberg

 

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,797 Views

If your program allocates memory you will encounter first touch overhead on the first time you write to every page (typically 4KB or 4MB). On first touch, this generates a page fault trapping to the OS, the OS then determines if the page addressed is valid for your program's Virtual Machine, if valid, an page is assigned to the page file and/or a page of physical RAM is mapped to the virtual memory of the process. This is 1000's of time longer than a cache miss.

The usual procedure is to discard the first run, or first few runs if program working data migrates about.

For some programs that run once, consider walking the freshly (first time) allocated memory writing every page size offset. Doing so will incur the same overhead for mapping but will also avoid any cache evictions that will occur if the first touch page mapping is perform while executing the code you wish to time.

If using OpenMP then use the omp_get_time() function for timing. I even use this for timing serial code.

EDIT: omp_get_wtime()

Jim Dempsey

0 Kudos
TimP
Honored Contributor III
1,796 Views

As Jim pointed out, the OpenMP timer overcomes the 64 Hz limitation of certain timers.  On linux (and, presumably, Mac), system_clock is fine when you use integer(8) arguments.

0 Kudos
Reply