- Marquer comme nouveau
- Marquer
- S'abonner
- Sourdine
- S'abonner au fil RSS
- Surligner
- Imprimer
- Signaler un contenu inapproprié
Our software does lots of floating point linear algebra (CFD modeling). Many operations require double precision for larger size matrices. Unfortunately the performance we have been getting out of both Intel and AMD processors was very poor: double precision was almost 2x slower than the same code when using single precision.
We tested the software both under 32-bit and 64-bit Windows (and compiled with corresponding 32 and 64 bit options) with no visible difference. Are there compiler options that we are missing that could speed up double precision computations?
Thanks.
- Balises:
- Intel® Fortran Compiler
Lien copié
- Marquer comme nouveau
- Marquer
- S'abonner
- Sourdine
- S'abonner au fil RSS
- Surligner
- Imprimer
- Signaler un contenu inapproprié
In general, I would recommend the following options.
/O3
/QxS (or T or P), depending on which Intel processor model you're using.
/Qprec-div-
I'd also strongly recommend running the application through the VTune Peformance Analyzer and see what it comes up with for hot spots and data on things such as cache misses.
- Marquer comme nouveau
- Marquer
- S'abonner
- Sourdine
- S'abonner au fil RSS
- Surligner
- Imprimer
- Signaler un contenu inapproprié
Thanks,Steve, my expections based on UNIX workstations experience (non-Intel processors), is that it should be roughly the same, i.e. within a few percent.
Arik
- Marquer comme nouveau
- Marquer
- S'abonner
- Sourdine
- S'abonner au fil RSS
- Surligner
- Imprimer
- Signaler un contenu inapproprié
You definitely want to make sure that you are enabling vectorization on the Intel and AMD processors. (For AMD, use /QxW or /QxO (latter for SSE3-capable Opterons).)
/Qipo may also be of use.
- Marquer comme nouveau
- Marquer
- S'abonner
- Sourdine
- S'abonner au fil RSS
- Surligner
- Imprimer
- Signaler un contenu inapproprié
what can we do about cache misses? Thanks,
Arik
- Marquer comme nouveau
- Marquer
- S'abonner
- Sourdine
- S'abonner au fil RSS
- Surligner
- Imprimer
- Signaler un contenu inapproprié
I never had an IBM, we used Spark(s), and HPs, as well as supercomputers such as Cray and Alliant.
As to vectorization options, we certainly use them, but they are the ones, which give grief. There are cases where one matrix solver dies, even though it would work fine with these option off.
- Marquer comme nouveau
- Marquer
- S'abonner
- Sourdine
- S'abonner au fil RSS
- Surligner
- Imprimer
- Signaler un contenu inapproprié
But first, you need to know what exactly the program is doing, and that's where VTune can help.
- Marquer comme nouveau
- Marquer
- S'abonner
- Sourdine
- S'abonner au fil RSS
- Surligner
- Imprimer
- Signaler un contenu inapproprié
- Marquer comme nouveau
- Marquer
- S'abonner
- Sourdine
- S'abonner au fil RSS
- Surligner
- Imprimer
- Signaler un contenu inapproprié
Specifically the software spends about 90% of time in subroutines using conjugate-gradient methods to solve banded matrices resulting from discretizing governing equations on a structured grid. All arrays are 3D, i.e. a(i,j,k) and hence each expression typically contains a(i,j+1,k), a(i,j,k-1) etc terms, which are not immediately next to each other storage-wise.
I would appreciate if you could comment more specifically as to what's "economize access to grid information" as well as "cache locality" means and how one can go about dealing with these issues. I don't expect a recipe, but rather a refence to papers/books etc. that explain this. I wonder if Intel has any papers addressing these issues.
Thanks.
Arik
- Marquer comme nouveau
- Marquer
- S'abonner
- Sourdine
- S'abonner au fil RSS
- Surligner
- Imprimer
- Signaler un contenu inapproprié
If your data are stored that way, arranging your loops with the i index in the inner loop, j in the middle loop, and operating on all dependent variables in the same loop, should do the job. Typical cache miss problems in such codes are due to irregular storage of automatic meshed data; apparently, you should not be seeing so much difficulty. Ideally, you want to use all the data from a cache line, then move on to the next cache line, allowing hardware prefetch to bring in the cache lines which will be needed next.
If you must operate with the k index in the inner loop, unrolling and "jamming" outer loops so that you still operate on several values of i before moving to the next k should help. ifort -O3 can deal to a limited extent with re-ordering loops, when the nested loops aren't hidden by subroutine calls.
- Marquer comme nouveau
- Marquer
- S'abonner
- Sourdine
- S'abonner au fil RSS
- Surligner
- Imprimer
- Signaler un contenu inapproprié
Definitely and always we've used i as the inner loop, etc. Here's a sample loop from the code:
!$omp parallel do
!$omp& private( i, j, k )
do k=1,Nz
do j=1,Ny
do i=1,Nx
! if possible, use canned routines
rb(i,j,k) = z(i,j,k) + beta*rb(i,j,k)
enddo
enddo
enddo
Apparently this has not done the trick.
- Marquer comme nouveau
- Marquer
- S'abonner
- Sourdine
- S'abonner au fil RSS
- Surligner
- Imprimer
- Signaler un contenu inapproprié
You first asked about the relative difference between single and double. How is the overall performance compared to the UNIX platforms?
- Marquer comme nouveau
- Marquer
- S'abonner
- Sourdine
- S'abonner au fil RSS
- Surligner
- Imprimer
- Signaler un contenu inapproprié
- Marquer comme nouveau
- Marquer
- S'abonner
- Sourdine
- S'abonner au fil RSS
- Surligner
- Imprimer
- Signaler un contenu inapproprié
Arikd,
Your sample parallel do loop might benefit from assuring (!dec$ to declare) the arrays rb and z as being aligned and thus improve the vectorization performance of the inner most loop.
If Ny*Nx is quite large then you might benefit from using a dynamic schedule with an appropriate beginning chunk size. i.e. program to keep all available cores to finish at approximately the same time.
Also, depending on the relativevalues of Nz and Ny and the number of cores available it may be advantageous to swap orders of the outer loop. (or test and run on of two !$omp parallel do loops)
As a diagnostic aid during performance tuning I find it beneficial toset thread affinities at initialization. Then due the the threads not squirming about different processors, the per cpu runtime statistics on the nested loop as in your example, will easily disclose work starved threads.
Jim Dempsey
- Marquer comme nouveau
- Marquer
- S'abonner
- Sourdine
- S'abonner au fil RSS
- Surligner
- Imprimer
- Signaler un contenu inapproprié
We are still on v.9.x, but are planning to port it over to 10right after new year.
I can't say much about UNIX these days, have not used any UNIX computers since last century. At the time when we switched, somewhere around 1994, 486-66 was giving the Spark a good run for the money, especially if you take into account that the least expensive Spark was, if I recall it correctly, ~10+k, while a comparable PC ~3k.
I wonder it the r8 vs r4 performance hit we see is due to Intel chips being still on some level 32 bit chips, while the old Unix workstations, were, I believe, native 64 bit.
- Marquer comme nouveau
- Marquer
- S'abonner
- Sourdine
- S'abonner au fil RSS
- Surligner
- Imprimer
- Signaler un contenu inapproprié
- Marquer comme nouveau
- Marquer
- S'abonner
- Sourdine
- S'abonner au fil RSS
- Surligner
- Imprimer
- Signaler un contenu inapproprié
- Marquer comme nouveau
- Marquer
- S'abonner
- Sourdine
- S'abonner au fil RSS
- Surligner
- Imprimer
- Signaler un contenu inapproprié
- Marquer comme nouveau
- Marquer
- S'abonner
- Sourdine
- S'abonner au fil RSS
- Surligner
- Imprimer
- Signaler un contenu inapproprié
If you restricted your application to x87 code, you might find that double precision slows down typically by 40%, compared to vectorized code, while single precision slows down much more, in case your goal is to find a compilation mode where there is relatively little difference between single and double precision. Intel compilers support x87 code only when running 32-bit mode. gfortran supports -mfpmath=387 in both 32- and 64-bit mode, in case you have a reason for using it.
IA-64 processors also show less difference in performance between single and double precision than Xeon does. There are no single precision 128-bit wide SIMD operations.
- Marquer comme nouveau
- Marquer
- S'abonner
- Sourdine
- S'abonner au fil RSS
- Surligner
- Imprimer
- Signaler un contenu inapproprié
I read some papers on the subject of optimization of code and it appears a bit more than one can do on the fly. I wonder if you could you recommend either individuals or companies, which specialize in optimizing Fortran matrix solvers for Intel processors?
- Marquer comme nouveau
- Marquer
- S'abonner
- Sourdine
- S'abonner au fil RSS
- Surligner
- Imprimer
- Signaler un contenu inapproprié

- S'abonner au fil RSS
- Marquer le sujet comme nouveau
- Marquer le sujet comme lu
- Placer ce Sujet en tête de liste pour l'utilisateur actuel
- Marquer
- S'abonner
- Page imprimable