Solved: Re: Slower Intel Compiled Program

psantos · ‎12-27-2009

Good evening. I recently registered in this this forum and I need some help. During one year period I used Compaq Visual Fortran to develop and compile my programs. Recently I became interested in Intel Visual Fortran because I red in various magazines and Internet sites that this compiler is more efficient and programs are better optimized than with Compaq. However I'm a bit frustrated because I spent some work converting my Compaq Programmes to Intel and I found that they run much slowly with Intel. In some of them the overall computing time increase by a factor of 4. What worries me the most is that the difference in computing time are obvious, not only an increase of 5 or 10%, but 400%. If somebody can help me I will be very grateful.

Pedro

Martyn_C_Intel · ‎12-28-2009

Hi Pedro,
Does your program contain any REAL*16 or COMPLEX*32 code or calls, eg beginning with DQ or ZQ?
The only area I am aware of where the latest IMSL is substantially slower than the one shipped with CVF is in the use of quad precision (REAL*16) arithmetic. That is because CVF did not support REAL(16), so the IMSL authors had implemented an approximate form of extended precision. The Intel compiler does support fullIEEE quad precision, but does so in software, so it is rather slow, significantly slower than the approximate extended precision implemented in the IMSL shipping with CVF. Certain double precision routines from IMSL use quad precision internally to improve accuracy. If you have a tool such as the Intel VTune Performance Analyzer, you might be able to generate a call graph that would show calls to DQ or ZQ functions. I have seen an application that called the linear solver DLSACG that slowed down by a factor of between 2X and 4X due to this effect.
If this is indeed the explanation of your observations, you might consider trying the single precision version of the solver you are using, since I believe that in this case, the intermediate sums would still be accumulated in double precision. Perhaps the accuracy would be sufficient for your needs.
Otherwise, you might consider posting to an IMSL forum,and asking whether VNI/Rogue Wave would considerreverting to the older, less accurate but faster implementation of quad precision for accumulation routines in future versions of IMSL,if the precision seems sufficient.
And as Steve has said, perhaps you could find a differentsolver that meets your needs in another library such as MKL. For solvers in either IMSL or MKL, you may be able to get a significant speedup on multi-core systems if threading is enabled.

Martyn Corden
Intel Developer Support

View solution in original post

Steven_L_Intel1 · ‎12-27-2009

Can you show us a program that has this behavior? Are you sure you are building with comparable options?

psantos · ‎12-27-2009

Quoting - Steve Lionel (Intel)

Can you show us a program that has this behavior? Are you sure you are building with comparable options?

The program is lengthy and it red info from .txt files. On Compaq I never changed the default options of the compiler. On Intel I disabled parallelization and choose the built suitable to Intel core due, since I'm currently running an Intel T2500. I also tried to change the type of optimization used by Intel and the overall result is the same. I will be sincerely grateful if you can give me some hints. Thank you very much.

Pedro

Steven_L_Intel1 · ‎12-27-2009

Without seeing the application, there's little I can suggest. What options are you using? If you are in Visual Studio, go to the Fortran > Command Line page and copy-paste the options shown there.

psantos · ‎12-27-2009

Yes, I'm using Visual Studio 2008. I'm also using the f90 version of IMSL libraries. It's the only difference between the Intel code and the Compaq code as in my Compaq codes I used f77 version of IMSL. The options of the compiler are the following:
/nologo /debug:full /QaxSSE3 /QxHost /Qipo /assume:nocc_omp
/arch:SSE3 /warn:unused /Qopt-report:2 /Qsave /iface:cvf
/module:"Debug/" /object:"Debug/" /traceback /check:bounds
/libs:static /threads /dbglibs /c

Thank you for your help.

Pedro

TimP · ‎12-27-2009

debug sets /Od, since you don't specify a level of optimization. /check:bounds is likely to compound the slowness. If you had permitted optimization, it might have been good to choose just one /arch option; supposing that the compiler takes you at your word that you want separate paths for 4 architectures, the code size expansion may hurt. If you have worked to make your code standard, you shouldn't need /Qsave /iface:cvf.

Steven_L_Intel1 · ‎12-28-2009

As Tim notes, you're using a Debug configuration, which is unoptimized. Please switch to a Release configuration. I suggest leaving /QxHost and removing the /QaxSSE3 and /arch:SSE3. If you plan on running the application on other systems, then remove /QxHost and read the documentation to see which /Qx option makes sense for you.

psantos · ‎12-28-2009

Thank you!
I try to make all the changes you suggested. My command line looks like:
/nologo /debug:minimal /O3 /Qipo /arch:SSE2 /warn:unused /Qopt-report:2 /module:"Debug/"
/object:"Debug/" /libs:dll /threads /c

However, my code is still slow when compared with Compaq. Other suggestion's will be appreciated!

Pedro

psantos · ‎12-28-2009

Quoting - Steve Lionel (Intel)

As Tim notes, you're using a Debug configuration, which is unoptimized. Please switch to a Release configuration. I suggest leaving /QxHost and removing the /QaxSSE3 and /arch:SSE3. If you plan on running the application on other systems, then remove /QxHost and read the documentation to see which /Qx option makes sense for you.

Steve I switch to release and I disabled the options you mentioned. My command is now:

/nologo /O3 /QxHost /module:"Release/" /object:"Release/" /libs:static /threads /c

And... program is still slow. I'm very disorientated and almost can't believe that it is possible. I have a doubt: which run-time library should I use? I'm currently using /libs:static /threads. Is this correct? Thank you for your help.

Pedro

Steven_L_Intel1 · ‎12-28-2009

The library type you selected is correct. Does your program do a lot of I/O? Try adding /assume:buffered_io

If that does not help, please ZIP the project ("build > clean" it first) and attach it to a reply here. Include any data files it needs.

psantos · ‎12-28-2009

Quoting - Steve Lionel (Intel)

The library type you selected is correct. Does your program do a lot of I/O? Try adding /assume:buffered_io

If that does not help, please ZIP the project ("build > clean" it first) and attach it to a reply here. Include any data files it needs.

I have tried /assume:buffered_io but it didn't help. I attached one of my programs, perhaps the one with the major time increase. Note that in order to work the path inside the file "project.txt" should be correct in order to match the path you will use to place the program. Thank for your help one more time.

Pedro

Steven_L_Intel1 · ‎12-28-2009

Pedro,

Thanks, but I'm very confused about something. How did you build this with CVF when it relies on IMSL 6 that was not provided for CVF? Your CVF project's settings point to the newer IMSL modules and libraries which would not work with CVF, and there are references to the "Fortran 90" interfaces for IMSL which CVF did not support.

Is your real CVF project available for me to look at?

psantos · ‎12-28-2009

Quoting - Steve Lionel (Intel)

Pedro,

Thanks, but I'm very confused about something. How did you build this with CVF when it relies on IMSL 6 that was not provided for CVF? Your CVF project's settings point to the newer IMSL modules and libraries which would not work with CVF, and there are references to the "Fortran 90" interfaces for IMSL which CVF did not support.

Is your real CVF project available for me to look at?

The source code program I attached in previous reply is the Intel Compiler version, not the Compaq version. The Compaq version is attached in this reply. Sorry about the confusion. Thank you again for spending your time helping me.

Pedro

Steven_L_Intel1 · ‎12-28-2009

Pedro,

It's not the compiler, it's IMSL. Your program spends most of its time calling IMSL routines, so even if the compiler generates better code for your sources, the execution time is completely dominated by the calls to IMSL.

First, I took your CVF sources and compiled them with CVF and IVF. I did see the slowdown you mentioned. (You had made a lot of changes in the IVF version so I wanted to compare same sources.) I then took the CVF project and forced it to link to the newer IMSL. When I ran this, the execution time was actually a bit slower than when compiled with IVF (not by much.)

There have been many changes made to IMSL since the IMSL 4 days (what came with CVF), and it's evident that at least one of the routines you're calling has slowed down a lot. Perhaps that was needed to get better accuracy or reliability, I don't know. It would take significant further investigation to identify the routine(s) responsible, but I'm doubtful that they would do anything about it in the near term.

I was able to cut the time down by about 1/3 by specifying "link_fnl_static_hpc.h" as the library selection, but it was still twice the time of the CVF build.

Sorry I don't have better news for you here. The only thing I can suggest at the moment is to see if any of the IMSL calls can be replaced with calls to Intel Math Kernel Library routines. I do have one side-comment: in your newer version where you used the "F90" interfaces to IMSL, please call the generic names of the routines and not D_ or S_ variants.

psantos · ‎12-28-2009

Quoting - Steve Lionel (Intel)

Pedro,

It's not the compiler, it's IMSL. Your program spends most of its time calling IMSL routines, so even if the compiler generates better code for your sources, the execution time is completely dominated by the calls to IMSL.

First, I took your CVF sources and compiled them with CVF and IVF. I did see the slowdown you mentioned. (You had made a lot of changes in the IVF version so I wanted to compare same sources.) I then took the CVF project and forced it to link to the newer IMSL. When I ran this, the execution time was actually a bit slower than when compiled with IVF (not by much.)

There have been many changes made to IMSL since the IMSL 4 days (what came with CVF), and it's evident that at least one of the routines you're calling has slowed down a lot. Perhaps that was needed to get better accuracy or reliability, I don't know. It would take significant further investigation to identify the routine(s) responsible, but I'm doubtful that they would do anything about it in the near term.

I was able to cut the time down by about 1/3 by specifying "link_fnl_static_hpc.h" as the library selection, but it was still twice the time of the CVF build.

Sorry I don't have better news for you here. The only thing I can suggest at the moment is to see if any of the IMSL calls can be replaced with calls to Intel Math Kernel Library routines. I do have one side-comment: in your newer version where you used the "F90" interfaces to IMSL, please call the generic names of the routines and not D_ or S_ variants.

Steve, you can imagine the help you provided to me! I'm very grateful to you. Although I haven't any performance analysis tool, I also suspected that IMSL has some relation to the slow down, since it's the major difference between the two programs. I think the routine LSARG() is the major responsible since I need it in double precision and it is called at least two times in each program iteration. I'm not very informed about the MKL but it will probable have a Linear System Solver. I will take a look to the documentation and I will confirm the speed up (or not). Finally, I would like to thank your advice about the generic names. I will correct this issue in my program. Thank you one more time for the time spent with my problem.

Gratefully,

Pedro

ZlamalJakub · ‎12-28-2009

Try to use AMD CodeAnalyst. It can provide You info about time spent in called routines (I hope it will be able identify IMSL routines one by one) http://developer.amd.com/cpu/codeanalyst/Pages/default.aspx it is free.

Jakub

Martyn_C_Intel · ‎12-28-2009

Hi Pedro,
Does your program contain any REAL*16 or COMPLEX*32 code or calls, eg beginning with DQ or ZQ?
The only area I am aware of where the latest IMSL is substantially slower than the one shipped with CVF is in the use of quad precision (REAL*16) arithmetic. That is because CVF did not support REAL(16), so the IMSL authors had implemented an approximate form of extended precision. The Intel compiler does support fullIEEE quad precision, but does so in software, so it is rather slow, significantly slower than the approximate extended precision implemented in the IMSL shipping with CVF. Certain double precision routines from IMSL use quad precision internally to improve accuracy. If you have a tool such as the Intel VTune Performance Analyzer, you might be able to generate a call graph that would show calls to DQ or ZQ functions. I have seen an application that called the linear solver DLSACG that slowed down by a factor of between 2X and 4X due to this effect.
If this is indeed the explanation of your observations, you might consider trying the single precision version of the solver you are using, since I believe that in this case, the intermediate sums would still be accumulated in double precision. Perhaps the accuracy would be sufficient for your needs.
Otherwise, you might consider posting to an IMSL forum,and asking whether VNI/Rogue Wave would considerreverting to the older, less accurate but faster implementation of quad precision for accumulation routines in future versions of IMSL,if the precision seems sufficient.
And as Steve has said, perhaps you could find a differentsolver that meets your needs in another library such as MKL. For solvers in either IMSL or MKL, you may be able to get a significant speedup on multi-core systems if threading is enabled.

Martyn Corden
Intel Developer Support

psantos · ‎12-28-2009

Steve and Martyn, I replaced the IMSL LSARG with a solver set (decomposition+solver+iterative refinement) from MKL and the performance is astonishing. For the same test case my old program takes about 2 min and the new with MKL takes 2 second. What a speed improvement! Thank you very much for the help. This forum is simply brilliant.

Gratefully,

Pedro

Steven_L_Intel1 · ‎12-28-2009

Wonderful news! Many of our customers are finding how easy it is to get better performance by calling MKL.