testing 64 bit performance in ifort

vibrantcascade · ‎07-14-2011

I have a chunk of code in fortranI'm using to do some scientific computing and I'm trying to see what the benefits of moving to a 64 bit compiler are going to be. It extensively uses 64 and 128 bit numbers and can run for months at a time. (real * 16, and doubles) Now in the past this code has been run on a cray, and the speed was about 5x as fastat a givenprocessor speed due to cray compilers fully leveraging the 64 bit registers to massively reduce the number of cpu cycles per calculation.

So I've decided to give it a try using the newest 64 bit intel compiler for linux on a core 2 vpro. (fortran composer for linux 2011 with the intel mkl64-bit downloaded on tuesday, and the core 2 duo architecture is fully 64-bit last i checked) I'm running on a fresh install ofubuntu 11.04 64-bit.

I've tried runningthe code with and without the-m64 command and as far as I can tell it makes no difference. And given the speed of the code when running on 32 bit systems compared to this system, it would appear that any speed increase I'm seeing is due entirely to processor frequency.

So as far as I can tellifort isn'tleveraging the full use ofthe 64 bit registers andI guess my question comes down to 2 things.

1. Have others encountered this at all? Any solutions?

2. Is there any sort of code I can use to measure the number of cpu cycles? I would like to run a 32 bit and a 64 bit version of ifort on different machines or even the same machine with different compiles,and measure the number of cpu cycles it takes to complete a math operation on double and quad precision numbers between the two versions.

Thanks!

mecej4 · ‎07-14-2011

If your program loads the FPU more than the IPU, as is typical of scientific calculations, it is perfectly reasonable that the speed difference is negligible. Note that the FPU registers are the same length (32, 64, 80, 128) whether you are running 32-bit or 64-bit code.

It is even possible that the 32-bit code may run faster on a 64-bit CPU on a 64-bit OS, because if 32-bit integers and pointers are adequate for your program there is no need to move twice as many bytes around through memory and cache.

In such a situation, i.e., a floating point bound program, there is not much for the compiler to "leverage".

The story could be quite different if your program needed 64-bit integers and did intensive computations with them.

vibrantcascade · ‎07-14-2011

Unfortunately, doubles only give around 15 digits of precision. And I need around 20-25 digits of precision to make these scientific calculations worthwhile. So I've been using real * 16 to get that precision and standard doubles for other numbers that aren't as sensitive. I haven't tried using doubles and singles and simply setting the double size to 128 and 64 yet as a compile option, but I'm assuming the compiler will handle them the exact same way. (If this isn't true, please let me know.)

Given that cray's are now running on intel processors, and the last cray I tested on had about a 5x speed increase per frequency per processor core with this code I figured it would make a similar difference with the newest 64 bit mkl and compiler and it was a software issue. And it might be able to save dozens of cycles piecing apart the 64 and 128 bit numbers, calculating them, and recombining them if it can work with 64 instead of 32 of the bits at a time. But if the FPU registers are the exact same size still that makes sense. Oh well thanks anyhow :(

jimdempseyatthecove · ‎07-14-2011

Intel64 archetecture does not support real*16 in hardware. It will use an emulation library. The FPU code can be run with 80-bit floating point but not 128-bits. If you need the full accuracy of real*16 then you will need to use the emulation library (automatically linked in when using real*16).

Emulation is relatively slow as compared to using FPU or SSE on doubles. SSEalso can vectorize some code (IOW perform the same op on two doubles at once).

I am not aware of any plans for next generation SSE/AVX to support 128-bit floating point numbers.
AVX2 has added some bit manipulation instructions so maybe the 128-bit floating point emulator might see some improvement.

Try to reduce the numbers of 128-bit variables to a minimum.

Note, if you are manipulating (relatively) very small numbers together with (relatively) very large numbers then consider having the numbers biased.

Example:

Assume you are writing a lunar (Titan) lander program inclusive of the solar system. The number of feet (meters) from the solar system barycenter (near sol) to (Titan) +/- landing control (in feet/meters)would (might)require you to express feet/meters from solar system barycenter to (Titan) using real*16.

However, if you establish a waypoint near to (Titan) in whole kilometers (or 100km units), (one variable), then use relative offsets from this waypoint (second variable). With this you attain ~104 bits of precision as opposed to ~112 bits of precision using the REAL*16 (exponents are 11 vs 16).

Jim Dempsey