Performance drop in Windows VM

Kevin_J_ · ‎06-07-2013

Hi all,

Thanks for reading. I'm building my fortran application on the different platforms OSX/Win/Linux using ifort. On a MacBookPro 8,2 with 2.4 GHz Core i7 (I7-2760QM) processor (quad-core), I compile and benchmark my fortran application:

1/ in the host OS (latest OSX, ifort 13.0.1) with "-O3"

2/ in a VMWare + Windows7 + ifort 12.0.5 with "Maximum speed and other optimizations" (i.e. O3 again) and afaik no settings that would penalize speed.

Problem: the application builds fine in the Win VM and produces correct results, but is 2.5x slower.

It uses only 1 thread, and in both situations I've monitored that CPU usage shows indeed 1 core at 100% during runtime. Memory usage is low and disk I/O almost nonexistent. Playing with compiler settings (e.g. explicitly setting instruction set to SSE3 or SSE4; AVX won't execute) didn't make a difference. I found a small speedup (5-10%) when configuring VMWare itself in the CPU settings for the VM to "enable hypervisor applications by providing support for Intel VT-x/EPT inside this VM" - whatever that means - but clearly it's nowhere near the 2.5x performance gap.

I duplicated the entire setup on an iMac with a similar processor and got about the same results.

I don't think the difference in ifort version (12 vs 13) will make much of a difference, but I can try this if requested.

So ... What is going on? Am I making some elementary beginner's mistake in working with VMs? Don't people always say you should get a virtualization penalty of only a few %? Is ifort unable to "see" the CPU through the VM and apply optimizations? How do I get the VM setup to exploit the hardware properly? And would the 2.5x penalty I have now carry over when my users run the Win executables on their own (native) Win systems?

Thanks for your help ...

Kevin Jorissen

jimdempseyatthecove · ‎06-08-2013

Is your app perfroming a lot of device I/O? In this case display device I/O.

If so, you may have issues with the device driver (32-bit driver on 64-bit system), and/or app run in other bit-ness than display driver.

Jim Dempsey

Kevin_J_ · ‎06-08-2013

I/O is essentially zero. Thanks for the suggestion.

SergeyKostrov · ‎06-08-2013

>>2/ in a VMWare + Windows7 + ifort 12.0.5 with "Maximum speed and other optimizations" (i.e. O3 again) and >>afaik no settings that would penalize speed. >> >>Problem: the application builds fine in the Win VM and produces correct results, but is 2.5x slower. Any performance measurements in the VMWare do not reflect realities and you need to complete a test ( or a set of tests ) without any Virtualization software. >>...Am I making some elementary beginner's mistake in working with VMs? I don't think so and just complete a test in a real Operating system.

Bernard · ‎06-11-2013

VMware emulates at least old BX440 chipset (IIRC) and for emulated by software real hardware probably by using constructs like PDO's which are accessed by corresponding drivers.As Sergey pointed it out such a setup do not reflect the real underlying hardware and do not fully exploit real hardware capabilities.

Bernard · ‎06-11-2013

I have forgotten to add that there is overhead of emulating real hardware at least partially by trying to recreate its functionality and it is done by emjulating software.For example on Hyper-V driver calls from emulated device are intercepted by vmmonitor and trnsfered to real hardware for execution.

Kevin_J_ · ‎07-19-2013

Dear all,

thank you very much for your suggestions. I apologize for the long delay in response. I can only spend some time each week on this project, and it's taken me some time to make progress.

I created a pure Windows 7 system (i.e. not a VM) and installed the whole deal with Visual Studio + Intel Fortran suite. Takes a while to install and run through all the updates but the resulting setup is nice :).

The application now sped up slightly ( ~10%?) compared to the VM, but was still much slower (2.5x - 3x) compared to the Mac version, despite the Win having a CPU that was 1 generation newer and somewhat higher clock speed. All in all it seemed that virtualization was not the major issue here.

Much of the execution time is spent on matrix operations using lapack/blas routines for which the source code is included in the fortran project and compiled on the host machine with full optimizations. On the Windows system, I ended up throwing this code out and instructing the compiler to use Intel MKL instead. The effect was tremendous: a speedup of 3x (from 120s to 40s), now making the Win application slightly faster than the Mac (50s) as you'd expect based on hardware specs. Note that I'm running a sequential (single-threaded) application.

The good news is that I now have a much more performant application for my users; a 3x speedup will look very nice for the next version release :).

To be done:

* revisit the Windows VM and see if I get the same speedup from using the Intel MKL there.

* try using optimized blas/lapack on the Mac and see if that gives a speedup. (If it does provide a similar speedup, then I am back at my original question, though with better performance. If it does not, then it begs the question of why compiling the bare lapack/blas code using ifort is so atrocious on Win compared to Mac.)

Thank you all for the time spent supporting me. I appreciate it very much. Further comments are of course very welcome.

Cheers,

Kevin

SergeyKostrov · ‎07-20-2013

Hi Kevin, >>...The application now sped up slightly ( ~10%?) compared to the VM, but was still much slower (2.5x - 3x) compared to >>the Mac version, despite the Win having a CPU that was 1 generation newer and somewhat higher clock speed. All in all >>it seemed that virtualization was not the major issue here. If the same Fortran compiler options are used then something is really wrong. However, we do not see technical details from you and what you say just generic explanations. Please, try to be technically specific regarding all these issues. >>Much of the execution time is spent on matrix operations using lapack/blas routines for which the source code is included >>in the fortran project and compiled on the host machine with full optimizations. On the Windows system, I ended up throwing >>this code out and instructing the compiler to use Intel MKL instead. The effect was tremendous: a speedup of 3x >>(from 120s to 40s), now making the Win application slightly faster than the Mac (50s) as you'd expect based on hardware specs. >>Note that I'm running a sequential (single-threaded) application. A performance of different matrix maltiplication algorithms was evaluated recently and I will provide you with a web-link for review. In overall, a switch to MKL-based matrix multiplication is the right decision. Please also verify that you're using the following set of Fortran compiler options ( for Windows and similar for Mac ): /O3 /fp:fast=2 /Qopt-matmul /Qmkl:parallel /Qparallel Also, do not neglect alignment for all allocated memory blocks ( I recently had ~29% performance improvement when alignment issues were fixed on some project ).

SergeyKostrov · ‎07-20-2013

>>...A performance of different matrix maltiplication algorithms was evaluated recently and I will provide >>you with a web-link for review... Please take into account that these are Evaluation results and Not Benchmarking results and results can not be cosidered as official and absolutely accurate. Forum Topic: Haswell GFLOPS Web-link: http://software.intel.com/en-us/forums/topic/394248 Txt-file with results: http://software.intel.com/sites/default/files/comment/1742855/matmultestresults.txt Here are results for single-threaded processing: [ Tests Set #1 - Part B - All Results Combined ] [ 4096x4096 ] Kronecker Based - 1.93 seconds (*) Perfwise - 2.54 seconds ( 4000x4000 ) (**) MKL - 3.68 seconds ( cblas_sgemm ) (*) Strassen HBC - 11.62 seconds (*) Fortran - 20.67 seconds ( MATMUL ) (*) Classic - 31.36 seconds (*) [ 8192x8192 ] Kronecker Based - 11.26 seconds ( 8100x8100 )(*) Perfwise - 20.30 seconds ( 8000x8000 ) (**) MKL - 29.34 seconds ( cblas_sgemm ) (*) Strassen HBC - 82.03 seconds (*) Fortran - 138.57 seconds ( MATMUL ) (*) Classic - 252.05 seconds [ 16384x16384 ] Kronecker Based - 81.52 seconds (*) Perfwise - 159.70 seconds ( 16000x16000 ) (**) MKL - 237.76 seconds ( cblas_sgemm ) (*) Strassen HBC - 1160.80 seconds (*) Fortran - 1685.09 seconds ( MATMUL ) (*) Classic - 2049.87 seconds (*) Notes: (*) Ivy Bridge CPU 2.80 GHz 1-core (**) Haswell CPU 3.50 GHz 1-core