- Отметить как новое
- Закладка
- Подписаться
- Отключить
- Подписка на RSS-канал
- Выделить
- Печать
- Сообщить о недопустимом содержимом
Good evening. I recently registered in this this forum and I need some help. During one year period I used Compaq Visual Fortran to develop and compile my programs. Recently I became interested in Intel Visual Fortran because I red in various magazines and Internet sites that this compiler is more efficient and programs are better optimized than with Compaq. However I'm a bit frustrated because I spent some work converting my Compaq Programmes to Intel and I found that they run much slowly with Intel. In some of them the overall computing time increase by a factor of 4. What worries me the most is that the difference in computing time are obvious, not only an increase of 5 or 10%, but 400%. If somebody can help me I will be very grateful.
Pedro
Pedro
1 Решение
- Отметить как новое
- Закладка
- Подписаться
- Отключить
- Подписка на RSS-канал
- Выделить
- Печать
- Сообщить о недопустимом содержимом
Hi Pedro,
Does your program contain any REAL*16 or COMPLEX*32 code or calls, eg beginning with DQ or ZQ?
The only area I am aware of where the latest IMSL is substantially slower than the one shipped with CVF is in the use of quad precision (REAL*16) arithmetic. That is because CVF did not support REAL(16), so the IMSL authors had implemented an approximate form of extended precision. The Intel compiler does support fullIEEE quad precision, but does so in software, so it is rather slow, significantly slower than the approximate extended precision implemented in the IMSL shipping with CVF. Certain double precision routines from IMSL use quad precision internally to improve accuracy. If you have a tool such as the Intel VTune Performance Analyzer, you might be able to generate a call graph that would show calls to DQ or ZQ functions. I have seen an application that called the linear solver DLSACG that slowed down by a factor of between 2X and 4X due to this effect.
If this is indeed the explanation of your observations, you might consider trying the single precision version of the solver you are using, since I believe that in this case, the intermediate sums would still be accumulated in double precision. Perhaps the accuracy would be sufficient for your needs.
Otherwise, you might consider posting to an IMSL forum,and asking whether VNI/Rogue Wave would considerreverting to the older, less accurate but faster implementation of quad precision for accumulation routines in future versions of IMSL,if the precision seems sufficient.
And as Steve has said, perhaps you could find a differentsolver that meets your needs in another library such as MKL. For solvers in either IMSL or MKL, you may be able to get a significant speedup on multi-core systems if threading is enabled.
Martyn Corden
Intel Developer Support
Does your program contain any REAL*16 or COMPLEX*32 code or calls, eg beginning with DQ or ZQ?
The only area I am aware of where the latest IMSL is substantially slower than the one shipped with CVF is in the use of quad precision (REAL*16) arithmetic. That is because CVF did not support REAL(16), so the IMSL authors had implemented an approximate form of extended precision. The Intel compiler does support fullIEEE quad precision, but does so in software, so it is rather slow, significantly slower than the approximate extended precision implemented in the IMSL shipping with CVF. Certain double precision routines from IMSL use quad precision internally to improve accuracy. If you have a tool such as the Intel VTune Performance Analyzer, you might be able to generate a call graph that would show calls to DQ or ZQ functions. I have seen an application that called the linear solver DLSACG that slowed down by a factor of between 2X and 4X due to this effect.
If this is indeed the explanation of your observations, you might consider trying the single precision version of the solver you are using, since I believe that in this case, the intermediate sums would still be accumulated in double precision. Perhaps the accuracy would be sufficient for your needs.
Otherwise, you might consider posting to an IMSL forum,and asking whether VNI/Rogue Wave would considerreverting to the older, less accurate but faster implementation of quad precision for accumulation routines in future versions of IMSL,if the precision seems sufficient.
And as Steve has said, perhaps you could find a differentsolver that meets your needs in another library such as MKL. For solvers in either IMSL or MKL, you may be able to get a significant speedup on multi-core systems if threading is enabled.
Martyn Corden
Intel Developer Support
Ссылка скопирована
18 Ответы
- Отметить как новое
- Закладка
- Подписаться
- Отключить
- Подписка на RSS-канал
- Выделить
- Печать
- Сообщить о недопустимом содержимом
Can you show us a program that has this behavior? Are you sure you are building with comparable options?
- Отметить как новое
- Закладка
- Подписаться
- Отключить
- Подписка на RSS-канал
- Выделить
- Печать
- Сообщить о недопустимом содержимом
Quoting - Steve Lionel (Intel)
Can you show us a program that has this behavior? Are you sure you are building with comparable options?
Pedro
- Отметить как новое
- Закладка
- Подписаться
- Отключить
- Подписка на RSS-канал
- Выделить
- Печать
- Сообщить о недопустимом содержимом
Without seeing the application, there's little I can suggest. What options are you using? If you are in Visual Studio, go to the Fortran > Command Line page and copy-paste the options shown there.
- Отметить как новое
- Закладка
- Подписаться
- Отключить
- Подписка на RSS-канал
- Выделить
- Печать
- Сообщить о недопустимом содержимом
Yes, I'm using Visual Studio 2008. I'm also using the f90 version of IMSL libraries. It's the only difference between the Intel code and the Compaq code as in my Compaq codes I used f77 version of IMSL. The options of the compiler are the following:
/nologo /debug:full /QaxSSE3 /QxHost /Qipo /assume:nocc_omp
/arch:SSE3 /warn:unused /Qopt-report:2 /Qsave /iface:cvf
/module:"Debug/" /object:"Debug/" /traceback /check:bounds
/libs:static /threads /dbglibs /c
Thank you for your help.
Pedro
/nologo /debug:full /QaxSSE3 /QxHost /Qipo /assume:nocc_omp
/arch:SSE3 /warn:unused /Qopt-report:2 /Qsave /iface:cvf
/module:"Debug/" /object:"Debug/" /traceback /check:bounds
/libs:static /threads /dbglibs /c
Thank you for your help.
Pedro
- Отметить как новое
- Закладка
- Подписаться
- Отключить
- Подписка на RSS-канал
- Выделить
- Печать
- Сообщить о недопустимом содержимом
debug sets /Od, since you don't specify a level of optimization. /check:bounds is likely to compound the slowness. If you had permitted optimization, it might have been good to choose just one /arch option; supposing that the compiler takes you at your word that you want separate paths for 4 architectures, the code size expansion may hurt. If you have worked to make your code standard, you shouldn't need /Qsave /iface:cvf.
- Отметить как новое
- Закладка
- Подписаться
- Отключить
- Подписка на RSS-канал
- Выделить
- Печать
- Сообщить о недопустимом содержимом
As Tim notes, you're using a Debug configuration, which is unoptimized. Please switch to a Release configuration. I suggest leaving /QxHost and removing the /QaxSSE3 and /arch:SSE3. If you plan on running the application on other systems, then remove /QxHost and read the documentation to see which /Qx option makes sense for you.
- Отметить как новое
- Закладка
- Подписаться
- Отключить
- Подписка на RSS-канал
- Выделить
- Печать
- Сообщить о недопустимом содержимом
Thank you!
I try to make all the changes you suggested. My command line looks like:
/nologo /debug:minimal /O3 /Qipo /arch:SSE2 /warn:unused /Qopt-report:2 /module:"Debug/"
/object:"Debug/" /libs:dll /threads /c
However, my code is still slow when compared with Compaq. Other suggestion's will be appreciated!
Pedro
I try to make all the changes you suggested. My command line looks like:
/nologo /debug:minimal /O3 /Qipo /arch:SSE2 /warn:unused /Qopt-report:2 /module:"Debug/"
/object:"Debug/" /libs:dll /threads /c
However, my code is still slow when compared with Compaq. Other suggestion's will be appreciated!
Pedro
- Отметить как новое
- Закладка
- Подписаться
- Отключить
- Подписка на RSS-канал
- Выделить
- Печать
- Сообщить о недопустимом содержимом
Quoting - Steve Lionel (Intel)
As Tim notes, you're using a Debug configuration, which is unoptimized. Please switch to a Release configuration. I suggest leaving /QxHost and removing the /QaxSSE3 and /arch:SSE3. If you plan on running the application on other systems, then remove /QxHost and read the documentation to see which /Qx option makes sense for you.
/nologo /O3 /QxHost /module:"Release/" /object:"Release/" /libs:static /threads /c
And... program is still slow. I'm very disorientated and almost can't believe that it is possible. I have a doubt: which run-time library should I use? I'm currently using /libs:static /threads. Is this correct? Thank you for your help.
Pedro
- Отметить как новое
- Закладка
- Подписаться
- Отключить
- Подписка на RSS-канал
- Выделить
- Печать
- Сообщить о недопустимом содержимом
The library type you selected is correct. Does your program do a lot of I/O? Try adding /assume:buffered_io
If that does not help, please ZIP the project ("build > clean" it first) and attach it to a reply here. Include any data files it needs.
- Отметить как новое
- Закладка
- Подписаться
- Отключить
- Подписка на RSS-канал
- Выделить
- Печать
- Сообщить о недопустимом содержимом
Quoting - Steve Lionel (Intel)
The library type you selected is correct. Does your program do a lot of I/O? Try adding /assume:buffered_io
If that does not help, please ZIP the project ("build > clean" it first) and attach it to a reply here. Include any data files it needs.
Pedro
- Отметить как новое
- Закладка
- Подписаться
- Отключить
- Подписка на RSS-канал
- Выделить
- Печать
- Сообщить о недопустимом содержимом
Pedro,
Thanks, but I'm very confused about something. How did you build this with CVF when it relies on IMSL 6 that was not provided for CVF? Your CVF project's settings point to the newer IMSL modules and libraries which would not work with CVF, and there are references to the "Fortran 90" interfaces for IMSL which CVF did not support.
Is your real CVF project available for me to look at?
Thanks, but I'm very confused about something. How did you build this with CVF when it relies on IMSL 6 that was not provided for CVF? Your CVF project's settings point to the newer IMSL modules and libraries which would not work with CVF, and there are references to the "Fortran 90" interfaces for IMSL which CVF did not support.
Is your real CVF project available for me to look at?
- Отметить как новое
- Закладка
- Подписаться
- Отключить
- Подписка на RSS-канал
- Выделить
- Печать
- Сообщить о недопустимом содержимом
Quoting - Steve Lionel (Intel)
Pedro,
Thanks, but I'm very confused about something. How did you build this with CVF when it relies on IMSL 6 that was not provided for CVF? Your CVF project's settings point to the newer IMSL modules and libraries which would not work with CVF, and there are references to the "Fortran 90" interfaces for IMSL which CVF did not support.
Is your real CVF project available for me to look at?
Thanks, but I'm very confused about something. How did you build this with CVF when it relies on IMSL 6 that was not provided for CVF? Your CVF project's settings point to the newer IMSL modules and libraries which would not work with CVF, and there are references to the "Fortran 90" interfaces for IMSL which CVF did not support.
Is your real CVF project available for me to look at?
Pedro
- Отметить как новое
- Закладка
- Подписаться
- Отключить
- Подписка на RSS-канал
- Выделить
- Печать
- Сообщить о недопустимом содержимом
Pedro,
It's not the compiler, it's IMSL. Your program spends most of its time calling IMSL routines, so even if the compiler generates better code for your sources, the execution time is completely dominated by the calls to IMSL.
First, I took your CVF sources and compiled them with CVF and IVF. I did see the slowdown you mentioned. (You had made a lot of changes in the IVF version so I wanted to compare same sources.) I then took the CVF project and forced it to link to the newer IMSL. When I ran this, the execution time was actually a bit slower than when compiled with IVF (not by much.)
There have been many changes made to IMSL since the IMSL 4 days (what came with CVF), and it's evident that at least one of the routines you're calling has slowed down a lot. Perhaps that was needed to get better accuracy or reliability, I don't know. It would take significant further investigation to identify the routine(s) responsible, but I'm doubtful that they would do anything about it in the near term.
I was able to cut the time down by about 1/3 by specifying "link_fnl_static_hpc.h" as the library selection, but it was still twice the time of the CVF build.
Sorry I don't have better news for you here. The only thing I can suggest at the moment is to see if any of the IMSL calls can be replaced with calls to Intel Math Kernel Library routines. I do have one side-comment: in your newer version where you used the "F90" interfaces to IMSL, please call the generic names of the routines and not D_ or S_ variants.
It's not the compiler, it's IMSL. Your program spends most of its time calling IMSL routines, so even if the compiler generates better code for your sources, the execution time is completely dominated by the calls to IMSL.
First, I took your CVF sources and compiled them with CVF and IVF. I did see the slowdown you mentioned. (You had made a lot of changes in the IVF version so I wanted to compare same sources.) I then took the CVF project and forced it to link to the newer IMSL. When I ran this, the execution time was actually a bit slower than when compiled with IVF (not by much.)
There have been many changes made to IMSL since the IMSL 4 days (what came with CVF), and it's evident that at least one of the routines you're calling has slowed down a lot. Perhaps that was needed to get better accuracy or reliability, I don't know. It would take significant further investigation to identify the routine(s) responsible, but I'm doubtful that they would do anything about it in the near term.
I was able to cut the time down by about 1/3 by specifying "link_fnl_static_hpc.h" as the library selection, but it was still twice the time of the CVF build.
Sorry I don't have better news for you here. The only thing I can suggest at the moment is to see if any of the IMSL calls can be replaced with calls to Intel Math Kernel Library routines. I do have one side-comment: in your newer version where you used the "F90" interfaces to IMSL, please call the generic names of the routines and not D_ or S_ variants.
- Отметить как новое
- Закладка
- Подписаться
- Отключить
- Подписка на RSS-канал
- Выделить
- Печать
- Сообщить о недопустимом содержимом
Quoting - Steve Lionel (Intel)
Pedro,
It's not the compiler, it's IMSL. Your program spends most of its time calling IMSL routines, so even if the compiler generates better code for your sources, the execution time is completely dominated by the calls to IMSL.
First, I took your CVF sources and compiled them with CVF and IVF. I did see the slowdown you mentioned. (You had made a lot of changes in the IVF version so I wanted to compare same sources.) I then took the CVF project and forced it to link to the newer IMSL. When I ran this, the execution time was actually a bit slower than when compiled with IVF (not by much.)
There have been many changes made to IMSL since the IMSL 4 days (what came with CVF), and it's evident that at least one of the routines you're calling has slowed down a lot. Perhaps that was needed to get better accuracy or reliability, I don't know. It would take significant further investigation to identify the routine(s) responsible, but I'm doubtful that they would do anything about it in the near term.
I was able to cut the time down by about 1/3 by specifying "link_fnl_static_hpc.h" as the library selection, but it was still twice the time of the CVF build.
Sorry I don't have better news for you here. The only thing I can suggest at the moment is to see if any of the IMSL calls can be replaced with calls to Intel Math Kernel Library routines. I do have one side-comment: in your newer version where you used the "F90" interfaces to IMSL, please call the generic names of the routines and not D_ or S_ variants.
It's not the compiler, it's IMSL. Your program spends most of its time calling IMSL routines, so even if the compiler generates better code for your sources, the execution time is completely dominated by the calls to IMSL.
First, I took your CVF sources and compiled them with CVF and IVF. I did see the slowdown you mentioned. (You had made a lot of changes in the IVF version so I wanted to compare same sources.) I then took the CVF project and forced it to link to the newer IMSL. When I ran this, the execution time was actually a bit slower than when compiled with IVF (not by much.)
There have been many changes made to IMSL since the IMSL 4 days (what came with CVF), and it's evident that at least one of the routines you're calling has slowed down a lot. Perhaps that was needed to get better accuracy or reliability, I don't know. It would take significant further investigation to identify the routine(s) responsible, but I'm doubtful that they would do anything about it in the near term.
I was able to cut the time down by about 1/3 by specifying "link_fnl_static_hpc.h" as the library selection, but it was still twice the time of the CVF build.
Sorry I don't have better news for you here. The only thing I can suggest at the moment is to see if any of the IMSL calls can be replaced with calls to Intel Math Kernel Library routines. I do have one side-comment: in your newer version where you used the "F90" interfaces to IMSL, please call the generic names of the routines and not D_ or S_ variants.
Gratefully,
Pedro
- Отметить как новое
- Закладка
- Подписаться
- Отключить
- Подписка на RSS-канал
- Выделить
- Печать
- Сообщить о недопустимом содержимом
Try to use AMD CodeAnalyst. It can provide You info about time spent in called routines (I hope it will be able identify IMSL routines one by one) http://developer.amd.com/cpu/codeanalyst/Pages/default.aspx it is free.
Jakub
Jakub
- Отметить как новое
- Закладка
- Подписаться
- Отключить
- Подписка на RSS-канал
- Выделить
- Печать
- Сообщить о недопустимом содержимом
Hi Pedro,
Does your program contain any REAL*16 or COMPLEX*32 code or calls, eg beginning with DQ or ZQ?
The only area I am aware of where the latest IMSL is substantially slower than the one shipped with CVF is in the use of quad precision (REAL*16) arithmetic. That is because CVF did not support REAL(16), so the IMSL authors had implemented an approximate form of extended precision. The Intel compiler does support fullIEEE quad precision, but does so in software, so it is rather slow, significantly slower than the approximate extended precision implemented in the IMSL shipping with CVF. Certain double precision routines from IMSL use quad precision internally to improve accuracy. If you have a tool such as the Intel VTune Performance Analyzer, you might be able to generate a call graph that would show calls to DQ or ZQ functions. I have seen an application that called the linear solver DLSACG that slowed down by a factor of between 2X and 4X due to this effect.
If this is indeed the explanation of your observations, you might consider trying the single precision version of the solver you are using, since I believe that in this case, the intermediate sums would still be accumulated in double precision. Perhaps the accuracy would be sufficient for your needs.
Otherwise, you might consider posting to an IMSL forum,and asking whether VNI/Rogue Wave would considerreverting to the older, less accurate but faster implementation of quad precision for accumulation routines in future versions of IMSL,if the precision seems sufficient.
And as Steve has said, perhaps you could find a differentsolver that meets your needs in another library such as MKL. For solvers in either IMSL or MKL, you may be able to get a significant speedup on multi-core systems if threading is enabled.
Martyn Corden
Intel Developer Support
Does your program contain any REAL*16 or COMPLEX*32 code or calls, eg beginning with DQ or ZQ?
The only area I am aware of where the latest IMSL is substantially slower than the one shipped with CVF is in the use of quad precision (REAL*16) arithmetic. That is because CVF did not support REAL(16), so the IMSL authors had implemented an approximate form of extended precision. The Intel compiler does support fullIEEE quad precision, but does so in software, so it is rather slow, significantly slower than the approximate extended precision implemented in the IMSL shipping with CVF. Certain double precision routines from IMSL use quad precision internally to improve accuracy. If you have a tool such as the Intel VTune Performance Analyzer, you might be able to generate a call graph that would show calls to DQ or ZQ functions. I have seen an application that called the linear solver DLSACG that slowed down by a factor of between 2X and 4X due to this effect.
If this is indeed the explanation of your observations, you might consider trying the single precision version of the solver you are using, since I believe that in this case, the intermediate sums would still be accumulated in double precision. Perhaps the accuracy would be sufficient for your needs.
Otherwise, you might consider posting to an IMSL forum,and asking whether VNI/Rogue Wave would considerreverting to the older, less accurate but faster implementation of quad precision for accumulation routines in future versions of IMSL,if the precision seems sufficient.
And as Steve has said, perhaps you could find a differentsolver that meets your needs in another library such as MKL. For solvers in either IMSL or MKL, you may be able to get a significant speedup on multi-core systems if threading is enabled.
Martyn Corden
Intel Developer Support
- Отметить как новое
- Закладка
- Подписаться
- Отключить
- Подписка на RSS-канал
- Выделить
- Печать
- Сообщить о недопустимом содержимом
Steve and Martyn, I replaced the IMSL LSARG with a solver set (decomposition+solver+iterative refinement) from MKL and the performance is astonishing. For the same test case my old program takes about 2 min and the new with MKL takes 2 second. What a speed improvement! Thank you very much for the help. This forum is simply brilliant.
Gratefully,
Pedro
Gratefully,
Pedro
- Отметить как новое
- Закладка
- Подписаться
- Отключить
- Подписка на RSS-канал
- Выделить
- Печать
- Сообщить о недопустимом содержимом
Wonderful news! Many of our customers are finding how easy it is to get better performance by calling MKL.
Ответить
Параметры темы
- Подписка на RSS-канал
- Отметить тему как новую
- Отметить тему как прочитанную
- Выполнить отслеживание данной Тема для текущего пользователя
- Закладка
- Подписаться
- Страница в формате печати