Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.
Announcements
FPGA community forums and blogs have moved to the Altera Community. Existing Intel Community members can sign in with their current credentials.

Massive performance issue with code on Haswell-EP (E5-1650v3)

Johannes_Rieke
New Contributor III
2,039 Views

Dear all, it's maybe off topic, but I know no other place to address this issue to. And it is Fortran related. I got a new workstation (E5-1650v3, 64GB ram, SSD, Win7 64 bits) from a big global player vendor two days ago and was happy to have a new toy. Unfortunately not long... After installing all drivers up to date, I installed VS and the Intel Compilers in version 15.0.1. Happy with fast installing on a SSD, I tested compiling my Fortran tools and was confused. All my code runs even slower than on my old workhorse (E5620). Not believing that I tested other programs and on another new E5-1650v3 workstation of a colleague. The same results. Then I tested the same executable with the same inputs on an IvyBride-EP (E5-1650v2) and got expected results (attached file: Performance_test_5810_custom_Fortran_code.pdf). My compiler options are: main code: /nologo /O3 /Qparallel /Qopt-matmul /fpp /I"D:\aux_lib\x64\Release" /I"D:\geo_lib\x64\Release" /arch:SSE3 /Qopenmp /Qvec-report1 /warn:all /real_size:64 /Qinit:zero /fp:source /Qfp-speculation=safe /module:"x64\Release\\" /object:"x64\Release\\" /Fd"x64\Release\vc100.pdb" /check:bounds /libs:static /threads /c aux_lib: /nologo /O3 /Qparallel /Qopt-matmul /fpp /arch:SSE3 /Qopenmp /Qvec-report1 /warn:declarations /warn:unused /warn:uncalled /warn:nousage /warn:interfaces /real_size:64 /Qinit:zero /fp:source /Qfp-speculation=safe /module:"x64\Release\\" /object:"x64\Release\\" /Fd"x64\Release\vc100.pdb" /libs:static /threads /c geo_lib: /nologo /O3 /Qparallel /Qopt-matmul /fpp /I"D:\aux_lib\x64\Release" /arch:SSE3 /Qopenmp /Qpar-report1 /Qvec-report1 /warn:all /real_size:64 /Qinit:zero /fp:source /Qfp-speculation=safe /module:"x64\Release\\" /object:"x64\Release\\" /Fd"x64\Release\vc100.pdb" /libs:static /threads /c To be sure that it is not something in my code, I took the Intel optimized Linpack benchmark out of the mkl folder and made some test runs on three different machines. The frustrating result is again, that the E5-1650v3 performs not nearly as fast as E5-1650v2. The E5-1650v3 is in my test 30 to 40 per cent slower. I wonder now where this comes from and whether I have to choose other compiler options to get the desired performance on a Haswell-EP? If anybody has a guess, what to do, please, write a comment. Best regards, Johannes

0 Kudos
6 Replies
Steven_L_Intel1
Employee
2,039 Views

Well, it's not "Fortran related" in that the same EXE runs differently. I do wonder why you are saying /arch:SSE3 - you're artificially constraining the compiler. Please use /QxHost or /QxCORE-AVX2 instead and see what you get. I also wonder if you're getting proper use of the multiple cores - have you run the program under VTune to analyze the threading performance?

0 Kudos
TimP
Honored Contributor III
2,039 Views

HSW platforms I had access to had no option to disable hyperthreading as usually done for such benchmarks.

opt-matmul would need mkl new enough to recognize v3 server.  When I last tested one, there had not been a public release of hsw server software tools even though hardware launch had occurred so we were blocked from completing some tests.  Mkl consistency mode might enable a comparison if you don't look for full performance on either platform.

0 Kudos
Johannes_Rieke
New Contributor III
2,039 Views

Dear Steve, dear Tim!


thanks for the comments.


Why /arch:SSE3: Many of our workstations are still Westmere-EPs (E5620) and do not support AVX as far as I know. Further in this case I wanted to avoid AVX because the turbo mode for the Haswell-EP is limited below the non-AVX, if I'm right. Later I wanted to see, if there is a speed up through AVX although the frequency is limited to a lower upper bound. A direct comparison with the same executable running on all Xeon generations has been preferred by me for the initial tests. Nevertheless, does not /fp:source and /Qfp-speculation=safe prevent the compiler to use SSE/AVX (at least for transcendentals)?
My self-written Fortran application was checked via VTune before and I erased some issues, where threading has been disadvantageous. However, I made one test compiled without OpenMP and the result is nearly identical to the case where I compiled with /Qopenmp and the limit is set to one thread only.
Hyperthreading has been active on all Xeons during the test. I never have seen a slow down before through hyperthreading.
Is MKL 11.2 (e.g. shipped with 15.0 update one) supporting Haswell-EP?
I've a gut feeling that the workstation manufacturer (one of the big enterprise suppliers) has to improve something (UEFI, firmware whatever). I have opened a support request on the manufacturers enterprise support and I'm very curious what they will answer. At least Linpack should not show this low performance on Haswell-EP compared to Ivybride-EP?
Does anybody owns an E5-1650v3 an can run the Intel Xeon optimized Linpack benchmark (MKL 11.2)? A result from an other brand workstation would be interesting.


Best regards, Johannes

0 Kudos
Johannes_Rieke
New Contributor III
2,039 Views
Dear all,
the issue is solved. Steve you are right. It has nothing to do with Fortran or the Intel compilers. The manufacturer released just 4 hours ago a new UEFI firmware. After I installed this, I get a completely other picture. All results are in the expected direction and I'm happy again with my new toy.
For the interested readers I attached the latest benchmark results.
However I will play around with AVX and I'm curious about the impact on the performance on my code.
Best regards, Johannes
0 Kudos
Steven_L_Intel1
Employee
2,039 Views

Glad to hear it.

0 Kudos
TimP
Honored Contributor III
2,039 Views

Interesting that firmware update helps.

/fp:source prevents vectorization where numerical results might vary slightly, but it should not prevent opt-matmul from choosing the most aggressive mkl function for your CPU. Likewise, mkl is not limited by your arch choice.

0 Kudos
Reply