- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi all,
Thanks for reading. I'm building my fortran application on the different platforms OSX/Win/Linux using ifort. On a MacBookPro 8,2 with 2.4 GHz Core i7 (I7-2760QM) processor (quad-core), I compile and benchmark my fortran application:
1/ in the host OS (latest OSX, ifort 13.0.1) with "-O3"
2/ in a VMWare + Windows7 + ifort 12.0.5 with "Maximum speed and other optimizations" (i.e. O3 again) and afaik no settings that would penalize speed.
Problem: the application builds fine in the Win VM and produces correct results, but is 2.5x slower.
It uses only 1 thread, and in both situations I've monitored that CPU usage shows indeed 1 core at 100% during runtime. Memory usage is low and disk I/O almost nonexistent. Playing with compiler settings (e.g. explicitly setting instruction set to SSE3 or SSE4; AVX won't execute) didn't make a difference. I found a small speedup (5-10%) when configuring VMWare itself in the CPU settings for the VM to "enable hypervisor applications by providing support for Intel VT-x/EPT inside this VM" - whatever that means - but clearly it's nowhere near the 2.5x performance gap.
I duplicated the entire setup on an iMac with a similar processor and got about the same results.
I don't think the difference in ifort version (12 vs 13) will make much of a difference, but I can try this if requested.
So ... What is going on? Am I making some elementary beginner's mistake in working with VMs? Don't people always say you should get a virtualization penalty of only a few %? Is ifort unable to "see" the CPU through the VM and apply optimizations? How do I get the VM setup to exploit the hardware properly? And would the 2.5x penalty I have now carry over when my users run the Win executables on their own (native) Win systems?
Thanks for your help ...
Kevin Jorissen
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Is your app perfroming a lot of device I/O? In this case display device I/O.
If so, you may have issues with the device driver (32-bit driver on 64-bit system), and/or app run in other bit-ness than display driver.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I/O is essentially zero. Thanks for the suggestion.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
VMware emulates at least old BX440 chipset (IIRC) and for emulated by software real hardware probably by using constructs like PDO's which are accessed by corresponding drivers.As Sergey pointed it out such a setup do not reflect the real underlying hardware and do not fully exploit real hardware capabilities.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have forgotten to add that there is overhead of emulating real hardware at least partially by trying to recreate its functionality and it is done by emjulating software.For example on Hyper-V driver calls from emulated device are intercepted by vmmonitor and trnsfered to real hardware for execution.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear all,
thank you very much for your suggestions. I apologize for the long delay in response. I can only spend some time each week on this project, and it's taken me some time to make progress.
I created a pure Windows 7 system (i.e. not a VM) and installed the whole deal with Visual Studio + Intel Fortran suite. Takes a while to install and run through all the updates but the resulting setup is nice :).
The application now sped up slightly ( ~10%?) compared to the VM, but was still much slower (2.5x - 3x) compared to the Mac version, despite the Win having a CPU that was 1 generation newer and somewhat higher clock speed. All in all it seemed that virtualization was not the major issue here.
Much of the execution time is spent on matrix operations using lapack/blas routines for which the source code is included in the fortran project and compiled on the host machine with full optimizations. On the Windows system, I ended up throwing this code out and instructing the compiler to use Intel MKL instead. The effect was tremendous: a speedup of 3x (from 120s to 40s), now making the Win application slightly faster than the Mac (50s) as you'd expect based on hardware specs. Note that I'm running a sequential (single-threaded) application.
The good news is that I now have a much more performant application for my users; a 3x speedup will look very nice for the next version release :).
To be done:
* revisit the Windows VM and see if I get the same speedup from using the Intel MKL there.
* try using optimized blas/lapack on the Mac and see if that gives a speedup. (If it does provide a similar speedup, then I am back at my original question, though with better performance. If it does not, then it begs the question of why compiling the bare lapack/blas code using ifort is so atrocious on Win compared to Mac.)
Thank you all for the time spent supporting me. I appreciate it very much. Further comments are of course very welcome.
Cheers,
Kevin
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page