Performance of double vs single precision

arikd · ‎12-24-2007

Our software does lots of floating point linear algebra (CFD modeling). Many operations require double precision for larger size matrices. Unfortunately the performance we have been getting out of both Intel and AMD processors was very poor: double precision was almost 2x slower than the same code when using single precision.

We tested the software both under 32-bit and 64-bit Windows (and compiled with corresponding 32 and 64 bit options) with no visible difference. Are there compiler options that we are missing that could speed up double precision computations?

Thanks.

Steven_L_Intel1 · ‎12-24-2007

I'm curious about your expectations. Do you have a reason to believe that single and double precision operations should be the same speed? I'll comment that if you have large arrays of data, computation time might be swamped by memory traffic, as double precision doubles the memory use.

In general, I would recommend the following options.

/O3
/QxS (or T or P), depending on which Intel processor model you're using.
/Qprec-div-

I'd also strongly recommend running the application through the VTune Peformance Analyzer and see what it comes up with for hot spots and data on things such as cache misses.

arikd · ‎12-24-2007

Thanks,Steve, my expections based on UNIX workstations experience (non-Intel processors), is that it should be roughly the same, i.e. within a few percent.

Arik

Steven_L_Intel1 · ‎12-24-2007

Was that an IBM workstation by any chance? The Power architecture does not have single-precision operations, if I recall correctly.

You definitely want to make sure that you are enabling vectorization on the Intel and AMD processors. (For AMD, use /QxW or /QxO (latter for SSE3-capable Opterons).)

/Qipo may also be of use.

arikd · ‎12-24-2007

what can we do about cache misses? Thanks,

Arik

arikd · ‎12-24-2007

I never had an IBM, we used Spark(s), and HPs, as well as supercomputers such as Cray and Alliant.

As to vectorization options, we certainly use them, but they are the ones, which give grief. There are cases where one matrix solver dies, even though it would work fine with these option off.

Steven_L_Intel1 · ‎12-24-2007

If you are seeing a lot of cache misses, you may want to look at how the application is accessing the arrays - is it following the elements in memory order or are there gaps between memory locations? The compiler can do a lot of things to help with -O3 but there's a limit. Some judicious use of prefetch directives MAY help once you understand the memory access pattern.

But first, you need to know what exactly the program is doing, and that's where VTune can help.

TimP · ‎12-24-2007

Simply telling us you have a CFD code doesn't give much information about your cache miss situation. You would want to process as many dependent variables as possible in a single loop, so as to economize access to your grid information. Strategies for sorting elements so as to improve cache locality may be worth investigation.

arikd · ‎12-27-2007

Specifically the software spends about 90% of time in subroutines using conjugate-gradient methods to solve banded matrices resulting from discretizing governing equations on a structured grid. All arrays are 3D, i.e. a(i,j,k) and hence each expression typically contains a(i,j+1,k), a(i,j,k-1) etc terms, which are not immediately next to each other storage-wise.

I would appreciate if you could comment more specifically as to what's "economize access to grid information" as well as "cache locality" means and how one can go about dealing with these issues. I don't expect a recipe, but rather a refence to papers/books etc. that explain this. I wonder if Intel has any papers addressing these issues.

Thanks.

Arik

TimP · ‎12-27-2007

If your data are stored that way, arranging your loops with the i index in the inner loop, j in the middle loop, and operating on all dependent variables in the same loop, should do the job. Typical cache miss problems in such codes are due to irregular storage of automatic meshed data; apparently, you should not be seeing so much difficulty. Ideally, you want to use all the data from a cache line, then move on to the next cache line, allowing hardware prefetch to bring in the cache lines which will be needed next.
If you must operate with the k index in the inner loop, unrolling and "jamming" outer loops so that you still operate on several values of i before moving to the next k should help. ifort -O3 can deal to a limited extent with re-ordering loops, when the nested loops aren't hidden by subroutine calls.

arikd · ‎12-27-2007

Definitely and always we've used i as the inner loop, etc. Here's a sample loop from the code:

!$omp parallel do
!$omp& private( i, j, k )
do k=1,Nz
do j=1,Ny
do i=1,Nx
! if possible, use canned routines
rb(i,j,k) = z(i,j,k) + beta*rb(i,j,k)
enddo
enddo
enddo

Apparently this has not done the trick.

Steven_L_Intel1 · ‎12-28-2007

Are you using version 10.1? That version has an improved optimizer that can both parallelize and vectorize code more efficiently.

You first asked about the relative difference between single and double. How is the overall performance compared to the UNIX platforms?

TimP · ‎12-28-2007

As you're using OpenMP on Windows, setting KMP_AFFINITY environment variable should bring some benefit, in case you haven't tested it.

jimdempseyatthecove · ‎12-28-2007

Arikd,

Your sample parallel do loop might benefit from assuring (!dec$ to declare) the arrays rb and z as being aligned and thus improve the vectorization performance of the inner most loop.

If Ny*Nx is quite large then you might benefit from using a dynamic schedule with an appropriate beginning chunk size. i.e. program to keep all available cores to finish at approximately the same time.

Also, depending on the relativevalues of Nz and Ny and the number of cores available it may be advantageous to swap orders of the outer loop. (or test and run on of two !$omp parallel do loops)

As a diagnostic aid during performance tuning I find it beneficial toset thread affinities at initialization. Then due the the threads not squirming about different processors, the per cpu runtime statistics on the nested loop as in your example, will easily disclose work starved threads.

Jim Dempsey

arikd · ‎12-30-2007

We are still on v.9.x, but are planning to port it over to 10right after new year.

I can't say much about UNIX these days, have not used any UNIX computers since last century. At the time when we switched, somewhere around 1994, 486-66 was giving the Spark a good run for the money, especially if you take into account that the least expensive Spark was, if I recall it correctly, ~10+k, while a comparable PC ~3k.

I wonder it the r8 vs r4 performance hit we see is due to Intel chips being still on some level 32 bit chips, while the old Unix workstations, were, I believe, native 64 bit.

Steven_L_Intel1 · ‎12-30-2007

No - the 64 vs. 32 bit has to do with address size and nothing to do with floating point.

arikd · ‎12-31-2007

In such a case,using double precisionexplains only the doubling of RAM requirements, but not doubling of CPU time. Any ideas?

Steven_L_Intel1 · ‎12-31-2007

I would expect double-precision arithmetic to be slower than single-precision. Without a test case to look at, I'm not willing to speculate further. I'd also want to make sure that you're getting the advantage of vectorization.

TimP · ‎12-31-2007

Comparing single and double precision vectorized code, double precision generally requires twice as many instructions to run the same job. Also, double precision evidently doubles the volume of data movement. A 90% increase in time to complete the same job in double precision is not unusual. As Steve says, you haven't given adequate information to judge whether this applies to your application.
If you restricted your application to x87 code, you might find that double precision slows down typically by 40%, compared to vectorized code, while single precision slows down much more, in case your goal is to find a compilation mode where there is relatively little difference between single and double precision. Intel compilers support x87 code only when running 32-bit mode. gfortran supports -mfpmath=387 in both 32- and 64-bit mode, in case you have a reason for using it.
IA-64 processors also show less difference in performance between single and double precision than Xeon does. There are no single precision 128-bit wide SIMD operations.

arikd · ‎01-10-2008

I read some papers on the subject of optimization of code and it appears a bit more than one can do on the fly. I wonder if you could you recommend either individuals or companies, which specialize in optimizing Fortran matrix solvers for Intel processors?

TimP · ‎01-10-2008

The example you provided is so simple that it's not clear how to improve on the suggestions already made. Where you have matrix operations which fit the lapack and BLAS mold, you have a wide variety of "canned solvers," as your comment says, including Intel MKL, AMD ACML, and more.