- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Our software does lots of floating point linear algebra (CFD modeling). Many operations require double precision for larger size matrices. Unfortunately the performance we have been getting out of both Intel and AMD processors was very poor: double precision was almost 2x slower than the same code when using single precision.
We tested the software both under 32-bit and 64-bit Windows (and compiled with corresponding 32 and 64 bit options) with no visible difference. Are there compiler options that we are missing that could speed up double precision computations?
Thanks.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In general, I would recommend the following options.
/O3
/QxS (or T or P), depending on which Intel processor model you're using.
/Qprec-div-
I'd also strongly recommend running the application through the VTune Peformance Analyzer and see what it comes up with for hot spots and data on things such as cache misses.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks,Steve, my expections based on UNIX workstations experience (non-Intel processors), is that it should be roughly the same, i.e. within a few percent.
Arik
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You definitely want to make sure that you are enabling vectorization on the Intel and AMD processors. (For AMD, use /QxW or /QxO (latter for SSE3-capable Opterons).)
/Qipo may also be of use.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
what can we do about cache misses? Thanks,
Arik
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I never had an IBM, we used Spark(s), and HPs, as well as supercomputers such as Cray and Alliant.
As to vectorization options, we certainly use them, but they are the ones, which give grief. There are cases where one matrix solver dies, even though it would work fine with these option off.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
But first, you need to know what exactly the program is doing, and that's where VTune can help.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Specifically the software spends about 90% of time in subroutines using conjugate-gradient methods to solve banded matrices resulting from discretizing governing equations on a structured grid. All arrays are 3D, i.e. a(i,j,k) and hence each expression typically contains a(i,j+1,k), a(i,j,k-1) etc terms, which are not immediately next to each other storage-wise.
I would appreciate if you could comment more specifically as to what's "economize access to grid information" as well as "cache locality" means and how one can go about dealing with these issues. I don't expect a recipe, but rather a refence to papers/books etc. that explain this. I wonder if Intel has any papers addressing these issues.
Thanks.
Arik
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If your data are stored that way, arranging your loops with the i index in the inner loop, j in the middle loop, and operating on all dependent variables in the same loop, should do the job. Typical cache miss problems in such codes are due to irregular storage of automatic meshed data; apparently, you should not be seeing so much difficulty. Ideally, you want to use all the data from a cache line, then move on to the next cache line, allowing hardware prefetch to bring in the cache lines which will be needed next.
If you must operate with the k index in the inner loop, unrolling and "jamming" outer loops so that you still operate on several values of i before moving to the next k should help. ifort -O3 can deal to a limited extent with re-ordering loops, when the nested loops aren't hidden by subroutine calls.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Definitely and always we've used i as the inner loop, etc. Here's a sample loop from the code:
!$omp parallel do
!$omp& private( i, j, k )
do k=1,Nz
do j=1,Ny
do i=1,Nx
! if possible, use canned routines
rb(i,j,k) = z(i,j,k) + beta*rb(i,j,k)
enddo
enddo
enddo
Apparently this has not done the trick.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You first asked about the relative difference between single and double. How is the overall performance compared to the UNIX platforms?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Arikd,
Your sample parallel do loop might benefit from assuring (!dec$ to declare) the arrays rb and z as being aligned and thus improve the vectorization performance of the inner most loop.
If Ny*Nx is quite large then you might benefit from using a dynamic schedule with an appropriate beginning chunk size. i.e. program to keep all available cores to finish at approximately the same time.
Also, depending on the relativevalues of Nz and Ny and the number of cores available it may be advantageous to swap orders of the outer loop. (or test and run on of two !$omp parallel do loops)
As a diagnostic aid during performance tuning I find it beneficial toset thread affinities at initialization. Then due the the threads not squirming about different processors, the per cpu runtime statistics on the nested loop as in your example, will easily disclose work starved threads.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
We are still on v.9.x, but are planning to port it over to 10right after new year.
I can't say much about UNIX these days, have not used any UNIX computers since last century. At the time when we switched, somewhere around 1994, 486-66 was giving the Spark a good run for the money, especially if you take into account that the least expensive Spark was, if I recall it correctly, ~10+k, while a comparable PC ~3k.
I wonder it the r8 vs r4 performance hit we see is due to Intel chips being still on some level 32 bit chips, while the old Unix workstations, were, I believe, native 64 bit.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If you restricted your application to x87 code, you might find that double precision slows down typically by 40%, compared to vectorized code, while single precision slows down much more, in case your goal is to find a compilation mode where there is relatively little difference between single and double precision. Intel compilers support x87 code only when running 32-bit mode. gfortran supports -mfpmath=387 in both 32- and 64-bit mode, in case you have a reason for using it.
IA-64 processors also show less difference in performance between single and double precision than Xeon does. There are no single precision 128-bit wide SIMD operations.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I read some papers on the subject of optimization of code and it appears a bit more than one can do on the fly. I wonder if you could you recommend either individuals or companies, which specialize in optimizing Fortran matrix solvers for Intel processors?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page