Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.
29236 Discussions

Performance of double vs single precision

arikd
Beginner
7,773 Views

Our software does lots of floating point linear algebra (CFD modeling). Many operations require double precision for larger size matrices. Unfortunately the performance we have been getting out of both Intel and AMD processors was very poor: double precision was almost 2x slower than the same code when using single precision.

We tested the software both under 32-bit and 64-bit Windows (and compiled with corresponding 32 and 64 bit options) with no visible difference. Are there compiler options that we are missing that could speed up double precision computations?

Thanks.

0 Kudos
39 Replies
Steven_L_Intel1
Employee
5,790 Views
I'm curious about your expectations. Do you have a reason to believe that single and double precision operations should be the same speed? I'll comment that if you have large arrays of data, computation time might be swamped by memory traffic, as double precision doubles the memory use.

In general, I would recommend the following options.

/O3
/QxS (or T or P), depending on which Intel processor model you're using.
/Qprec-div-

I'd also strongly recommend running the application through the VTune Peformance Analyzer and see what it comes up with for hot spots and data on things such as cache misses.
0 Kudos
arikd
Beginner
5,790 Views

Thanks,Steve, my expections based on UNIX workstations experience (non-Intel processors), is that it should be roughly the same, i.e. within a few percent.

Arik

0 Kudos
Steven_L_Intel1
Employee
5,790 Views
Was that an IBM workstation by any chance? The Power architecture does not have single-precision operations, if I recall correctly.

You definitely want to make sure that you are enabling vectorization on the Intel and AMD processors. (For AMD, use /QxW or /QxO (latter for SSE3-capable Opterons).)

/Qipo may also be of use.
0 Kudos
arikd
Beginner
5,790 Views

what can we do about cache misses? Thanks,

Arik

0 Kudos
arikd
Beginner
5,790 Views

I never had an IBM, we used Spark(s), and HPs, as well as supercomputers such as Cray and Alliant.

As to vectorization options, we certainly use them, but they are the ones, which give grief. There are cases where one matrix solver dies, even though it would work fine with these option off.

0 Kudos
Steven_L_Intel1
Employee
5,790 Views
If you are seeing a lot of cache misses, you may want to look at how the application is accessing the arrays - is it following the elements in memory order or are there gaps between memory locations? The compiler can do a lot of things to help with -O3 but there's a limit. Some judicious use of prefetch directives MAY help once you understand the memory access pattern.

But first, you need to know what exactly the program is doing, and that's where VTune can help.
0 Kudos
TimP
Honored Contributor III
5,790 Views
Simply telling us you have a CFD code doesn't give much information about your cache miss situation. You would want to process as many dependent variables as possible in a single loop, so as to economize access to your grid information. Strategies for sorting elements so as to improve cache locality may be worth investigation.

0 Kudos
arikd
Beginner
5,790 Views

Specifically the software spends about 90% of time in subroutines using conjugate-gradient methods to solve banded matrices resulting from discretizing governing equations on a structured grid. All arrays are 3D, i.e. a(i,j,k) and hence each expression typically contains a(i,j+1,k), a(i,j,k-1) etc terms, which are not immediately next to each other storage-wise.

I would appreciate if you could comment more specifically as to what's "economize access to grid information" as well as "cache locality" means and how one can go about dealing with these issues. I don't expect a recipe, but rather a refence to papers/books etc. that explain this. I wonder if Intel has any papers addressing these issues.

Thanks.

Arik

0 Kudos
TimP
Honored Contributor III
5,790 Views

If your data are stored that way, arranging your loops with the i index in the inner loop, j in the middle loop, and operating on all dependent variables in the same loop, should do the job. Typical cache miss problems in such codes are due to irregular storage of automatic meshed data; apparently, you should not be seeing so much difficulty. Ideally, you want to use all the data from a cache line, then move on to the next cache line, allowing hardware prefetch to bring in the cache lines which will be needed next.
If you must operate with the k index in the inner loop, unrolling and "jamming" outer loops so that you still operate on several values of i before moving to the next k should help. ifort -O3 can deal to a limited extent with re-ordering loops, when the nested loops aren't hidden by subroutine calls.
0 Kudos
arikd
Beginner
5,790 Views

Definitely and always we've used i as the inner loop, etc. Here's a sample loop from the code:

!$omp parallel do
!$omp& private( i, j, k )
do k=1,Nz
do j=1,Ny
do i=1,Nx
! if possible, use canned routines
rb(i,j,k) = z(i,j,k) + beta*rb(i,j,k)
enddo
enddo
enddo

Apparently this has not done the trick.

0 Kudos
Steven_L_Intel1
Employee
5,790 Views
Are you using version 10.1? That version has an improved optimizer that can both parallelize and vectorize code more efficiently.

You first asked about the relative difference between single and double. How is the overall performance compared to the UNIX platforms?
0 Kudos
TimP
Honored Contributor III
5,790 Views
As you're using OpenMP on Windows, setting KMP_AFFINITY environment variable should bring some benefit, in case you haven't tested it.
0 Kudos
jimdempseyatthecove
Honored Contributor III
5,790 Views

Arikd,

Your sample parallel do loop might benefit from assuring (!dec$ to declare) the arrays rb and z as being aligned and thus improve the vectorization performance of the inner most loop.

If Ny*Nx is quite large then you might benefit from using a dynamic schedule with an appropriate beginning chunk size. i.e. program to keep all available cores to finish at approximately the same time.

Also, depending on the relativevalues of Nz and Ny and the number of cores available it may be advantageous to swap orders of the outer loop. (or test and run on of two !$omp parallel do loops)

As a diagnostic aid during performance tuning I find it beneficial toset thread affinities at initialization. Then due the the threads not squirming about different processors, the per cpu runtime statistics on the nested loop as in your example, will easily disclose work starved threads.

Jim Dempsey

0 Kudos
arikd
Beginner
5,790 Views

We are still on v.9.x, but are planning to port it over to 10right after new year.

I can't say much about UNIX these days, have not used any UNIX computers since last century. At the time when we switched, somewhere around 1994, 486-66 was giving the Spark a good run for the money, especially if you take into account that the least expensive Spark was, if I recall it correctly, ~10+k, while a comparable PC ~3k.

I wonder it the r8 vs r4 performance hit we see is due to Intel chips being still on some level 32 bit chips, while the old Unix workstations, were, I believe, native 64 bit.

0 Kudos
Steven_L_Intel1
Employee
5,790 Views
No - the 64 vs. 32 bit has to do with address size and nothing to do with floating point.
0 Kudos
arikd
Beginner
5,790 Views
In such a case,using double precisionexplains only the doubling of RAM requirements, but not doubling of CPU time. Any ideas?
0 Kudos
Steven_L_Intel1
Employee
5,790 Views
I would expect double-precision arithmetic to be slower than single-precision. Without a test case to look at, I'm not willing to speculate further. I'd also want to make sure that you're getting the advantage of vectorization.
0 Kudos
TimP
Honored Contributor III
5,790 Views
Comparing single and double precision vectorized code, double precision generally requires twice as many instructions to run the same job. Also, double precision evidently doubles the volume of data movement. A 90% increase in time to complete the same job in double precision is not unusual. As Steve says, you haven't given adequate information to judge whether this applies to your application.
If you restricted your application to x87 code, you might find that double precision slows down typically by 40%, compared to vectorized code, while single precision slows down much more, in case your goal is to find a compilation mode where there is relatively little difference between single and double precision. Intel compilers support x87 code only when running 32-bit mode. gfortran supports -mfpmath=387 in both 32- and 64-bit mode, in case you have a reason for using it.
IA-64 processors also show less difference in performance between single and double precision than Xeon does. There are no single precision 128-bit wide SIMD operations.
0 Kudos
arikd
Beginner
5,790 Views

I read some papers on the subject of optimization of code and it appears a bit more than one can do on the fly. I wonder if you could you recommend either individuals or companies, which specialize in optimizing Fortran matrix solvers for Intel processors?

0 Kudos
TimP
Honored Contributor III
5,324 Views
The example you provided is so simple that it's not clear how to improve on the suggestions already made. Where you have matrix operations which fit the lapack and BLAS mold, you have a wide variety of "canned solvers," as your comment says, including Intel MKL, AMD ACML, and more.
0 Kudos
Reply