Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.
29376 Discussions

What are the advantages of 64-bit versions of Fortran programs on Windows 7 64-bit OS?

FortranFan
Honored Contributor III
2,474 Views
Hi,

My apologies if this question has already been posted and answered - in that case,anylinks to those threads will be very helpful since I didn't locate any appropriate discussion on this topic during my search.

Other than the benefit of addressable memory beyond the roughly 4 GB limit on 32-bit OS and other graphics-related accelerations, what are the other advantages of developing 64-bit versions of Fortran applications, say DLLs and executables, on Windows 7 64-bit OS?

To give you some background, I develop and maintain a large library of mixed language programs involving Fortran and C/C++ interfacing with bothunmanaged(VB6/VBA, ActiveX, etc.) and managed (.NET with C# or Visual Basic) Windows platforms involving chemical and mechanical engineering simulations and calculations. Note 1 to 2 GB of available memory is sufficient for these programs. There are few graphical elements to these applications; they are mostly number-crunchers. Our company has now moved over to Windows 7 64-bit OS and our primary development environment is Microsoft Visual Studio 2010 with Intel Fortran XE Composer 2011. Technically, we are now in a position to build most of our applications with either the x86 (Win32) or x64 setting and the bulk of Fortran (2003 and 95/90) and C/C++ (mostly ANSI) code is essentially portable across the two architectures. However, the current management decision is to remain on the x86 (Win32) architecture until there is a strong push or pull for 64-bit versions.

HenceIwould like to learn aboutthe pros and cons of moving to 64-bit (x64) architecture. Will the applications that have traditionally performed robustly on Windows XP 32-bit and which seem to run just as well under Microsoft WoW (32-bit Windows on 64-bit Windows) emulationcompute significantly faster or more robustly as native 64-bitprograms? One limitation for us, of course, is the interface of some of the programs to Microsoft Excel whose most stable verison isstill32-bit.

Specifically for Fortran programs built with Intel compiler, are there significant benefits of building as x64 compared toWin32? Are there white papers which summarize these benefits and provide details on programming studies completed to arrive at the conclusions?

Thanks much in advance for all advice and comments,




0 Kudos
11 Replies
TimP
Honored Contributor III
2,474 Views
Remember the old saw "YMMV."
1). 16-byte alignment is the default in X64. This may speed up some moderate length vectorized loops, as well as non-vector double precision loops.
2). In some cases of more complicated loops, the compiler can take advantage of the additional registers in 64-bit mode.
3). 64-bit integer arithmetic should be much faster in X64.
As you mention, the bigger incentive for X64 is to overcome address space limitations, which do impact many engineering applications.
Do you mean that you are creating .dll plogins for Excel? That might be a reason for 32-bit compilation. If you simply feed data to Excel, even if you launch Excel from Fortran, there seems no reason for coupling the Excel version with your choice of compilation mode.
0 Kudos
jimdempseyatthecove
Honored Contributor III
2,474 Views
>>Technically, we are now in a position to build most of our applications with either the x86 (Win32) or x64 setting and the bulk of Fortran (2003 and 95/90) and C/C++ (mostly ANSI) code is essentially portable across the two architectures.

Then it should be a relatively easy task to perform an x64 build and run comparison benchmarks of your x86 and x64 builds.

>>Note 1 to 2 GB of available memory is sufficient for these programs.

Then address space is not a driving factor

>>they are mostly number-crunchers.

Is the length of time for number-crunching a major concern to your customers?
If so, then after running the x32/x64 benchmarks you will have some metrics to help you make your decision. You will not know the difference in performance +/- for each major component until you test. Most of the time you will see improvement, sometimes large, most of the time ~15%, and on occasion slower. Testing will tell.

>>large library of mixed language programs involving Fortran and C/C++ interfacing with bothunmanaged(VB6/VBA, ActiveX, etc.) and managed (.NET with C# or Visual Basic)
...
>>Iwould like to learn aboutthe pros and cons of moving to 64-bit (x64) architecture.

The issue management may have is it appears you may have on the order of 8 different languages/compilers involved. Their concern may be that regression testing will at least double, and initially may be several times. What you may encounter is:

Outputs differ between x32 and x64. This can be due to different code paths taken in optimized code. The extra time in (initial) testing will involve determining if the different outputs are acceptable or if they are exposing an error in your code.

back to "1 to 2 GB of available memory is sufficient for these programs"

Wouldn't this be in the user domain as to what is sufficient.

Your biggest issue is interfacing with 32-bit Excel. MS does have a means to pass data between a x64 app and an x32 DLL. This can be used, but will add one more thing to your to-do list.

From management's point of view, you would be adding a lot of work with no motivating reason to do so.

I would think your biggest argument might be that you can do the port now, while not under pressure, and thus be ready when the situation flips to you must make the port. I do not know if you are old enough to remember the days of converting from 16-bit apps to 32-bit apps. Vendors could do the same thing: run in legacy mode saving on conversion costs but all the while losing market to competition. Then later, due to O/S upgrades, find that your app won't run well or won't run at all.

If your management sees a future with your product, then they should view making the conversion now, would be the prudent thing to do. Since now you have the time available to assure a smooth conversion.

I would like to comment that if your 32-bit applications are written well, then conversion could be (almost) as simple as making an x64 build configuration and then Build.

...However...

I have experiences on numerous occasions that authors of 32-bit programs would have been less than diligent in assuming that int is equivalent to pointer. This oversight may introduce issues in conversion that may be harder to correct than making proper type declarations. record/struct layouts may change, 3rd party library calls that pass opaque arguments as integer (but you use as pointers) may cascade issues. So what I am saying is this is all the more reason to make the conversion now, while you are not under pressure.

Jim Dempsey
0 Kudos
FortranFan
Honored Contributor III
2,474 Views
TimP,

Thank you very much for your response.

"Do you mean that you are creating .dll plugins for Excel?" - yes for several of the DLLs. Hence the version of Excel becomes a constraining factor in these cases.

"the bigger incentive for X64 is to overcome address space limitations, which do impact many engineering applications" - yes, one such situation came up a couple of months ago for which VS2010 and Intel XE 2011 was very helpful in creating a 64-bit version of a major segment of our code. That's how we know most of our code will work under x64.

But I appreciate all your feedback, especially about 16-byte alignment and 64-bit arithmetic. All in all, it appears our management is right in planning for a gradualmove to x64 - there are certain advantages but the benefitsmay only exceed the cost of change in certain circumstances.

Regards,




0 Kudos
FortranFan
Honored Contributor III
2,474 Views
Hello Jim,

Thank you very much for your feedback.

"run comparison benchmarks of your x86 and x64 builds" - yes, we have done some limited comparisons and we plan to do more later this summer. From our testing thus far, we have noticed up to 35% improvement in CPU time which is encouraging. But it is not enough of an advantage yet.

And yes, regression testing is always a huge concern and it consumessignificant resources as is with changes forced down our throat by Microsoft and ilk, hence there is little appetite for further change.

"Outputs differ between x32 and x64" - now this would be very worrisome to management. For a couple of common computational libraries that get used heavily in many of the engineering code which were run under x64, we didn't notice any difference in output - these libraries emply numerical algorithms to solve various non-linear problems in physical and mechanical systems and they follow accepted practices (ref: Numerical Recipes) to arrive at solutions within appropriate tolerances. Can you give some examples where the outputs are different?

"back to "1 to 2 GB of available memory is sufficient for these programs" Wouldn't this be in the user domain as to what is sufficient" - yes, the 1 to 2 GB memory usage is based on user feedback.

"MS does have a means to pass data between a x64 app and an x32 DLL" - is this similar to "thunking" offered by Microsoft many years ago during the transition from 16-bit (Windows 3.1, etc.) to 32-bit with Windows 95 onward to XP? We have not looked into this, butare there options for runtime exchange between x64 DLL and x86 app? Thiswould be the more common scenario for us with Excel being the x86 app. Can you provide links to sites that explain how to do this?

"I would think your biggest argument might be that you can do the port now, while not under pressure, and thus be ready when the situation flips to you must make the port" - point well-taken, but that might be too sensible an argument for some of the Dilbertian managers!

Your comments about pointer arithmetic are very helpful:
*our usage of pointers in Fortran is limited to standard 2003/95/90 features for linked lists or trees and forlate bindingof DLLs where either Intel extension of POINTER or 2003 C/Fortran function pointers are used
* the other use of pointers is in C/C++ and this could get tricky - here we would have to rely on subject matter experts and VS2010 to resolve issues as they arise.

All in all though, the management is probably right in going slow thus far.The benefits of x64 may only exceed the cost-of-change in certain situations as of now. But you're right: this is the right time to plan and develop a strategy for a move to x64 when we are not "under the gun".

Regards,
0 Kudos
jimdempseyatthecove
Honored Contributor III
2,474 Views
>>Can you give some examples where the outputs are different?

The x32 builds have, with several versions of compilers, used FPU (x87) code (80-bit internal computation) in places where vectorization could not be used. x64 compilers, places where vectorization is not used, tend to use SSEn instructions (32-bit or 64-bit internal computation). Thus, depending on compiler options, rounding differences could be noticed. Most of the time the calculations are correct to within the desired tolerance, however in some cases, in particular convergence routines, you may have a convergence that occures using a finer tolerance than can be attained with a 64-bit double precision (but can be attained with 80-bit T-byte format).

To control the optimizations you may have to experiment with using -fp-model ... (or /fp:...)

>>...thunk-ing...

What you can do is from your 64-bit app (or 32-bit) use CreateProcess to "run" as a seperate process a 32-bit app (or 64-bit). IOW other-bit app. This app will be your intra-bitness library call process. This new process and your original process can then open up a memory mapped file (same file, different address within respective process). This memory mapped file can thus be used as a "data stack" (not to be confused with call stack arguments). Note, someone may have done this already - so spend some time Googling.

Now, while you could write something specific to your application, I would suggest writing something general as a DLL (as two DLLs) that can be used to make other memory model DLL calls (and return to other model).

MS has functions to load a DLL by name, returning a handle, and then additional functions to find entry points supplying the DLL handle and the text of the entry point name. While you cannot load the cross bit-ness library, the other process you create can do this for you. A potential method could be:

"push" onto shared memory "data stack" a control word (e.g. integer) you specify as "Load Library"
"push" onto shared memory "data stack" the ASCIZ (or WCHARz) string of the library name.

Note, you may design this such that the args are pushed first and the control word last (iow so calls can be stacked).

Then use an interprocess event to the intra-bitness library process to indicate execute message. You can then either wait for an event from the other process or spin-wait for, say the control word to change from your code "Load Library" to success or fail. The success would return a handle to the library (known to the other process).

You would create similar functions to pass to the other-bitness mapping DLL for locating library entry points:

"push" onto shared memory "data stack" a control word you specify as "Find Library Function Entry"
"push" onto shared memory "data stack" the handle of the library (obtained earlier)
"push" onto shared memory "data stack" the ASCIZ (or WCHARz) string of the library function name.

Note, you may design this such that the args are pushed first and the control word last (iow so calls can be stacked).

The return would be success/fail and on success you retrieve a handle (e.g. 32-bit/64-bit function address or index in table of function addresses in the other bitness process).

Function calls in your FORTRAN app to the other bit-ness would have to take args passed by reference and push them by value onto the "data stack", then push the by-value args (IN, INOUT) and now new references to the copies of the data on the "data stack". On return, for OUT, INOUT you would have to copy back the updated values.

You could not make (or it would be harder to impliment) calls that allocate memory for the return.

This would be a sizeable project, something your management might not wish to underwright.

Jim Dempsey
0 Kudos
John_Campbell
New Contributor II
2,474 Views

Tim,

You wrote :
"1). 16-byte alignment is the default in X64. This may speed up some moderate length vectorized loops, as well as non-vector double precision loops."
I am not familiar with this alignment providing some efficiency.
I have a double precision (8-byte)skyline equation solver, where all columns are stored in a single vector. Consequently, 50% of the arguments to Dot_Product would not be on 16-byte alignment.
Should I be trying to improve this % ? (doesn't appear easy)
Will real*8 variables in COMMON or MODULESautomatically have 16-byte alignment ?
Where should I find more information on this form of optimisation ?

John

0 Kudos
TimP
Honored Contributor III
2,474 Views
In skyline storage, I would not expect that you could do anything useful about alignment. Each time a vector loop starts (until you get to the AVX option for Sandy Bridge) the first few elements will be processed individually until the loop reaches a point of 16-byte alignment in a chosen array. Even if the loop happens to start at an aligned point, there is some overhead involved in determining it. As a consequence, it takes a little bit longer array for vectorization to come out ahead. If the average loop length is small (say 50 or less), the /Qunroll0 option (or !dir$ unroll0 for an individual DO loop) may come out ahead by reducing the number of scalar iterations.
The AVX compilation will omit adjustment for alignment in some cases where the compiler recognizes it may be costly. As such a loop is optimized for unaligned data, it doesn't reach full speed on long aligned loops.
Yes, the arrays you define in COMMON (if you observe alignment rules, such as the legacy one of all doubles first) or in modules (barring derived types with SEQUENCE), or even local ones, will be set with 16-byte alignment.
0 Kudos
John_Campbell
New Contributor II
2,474 Views
Tim,

Is AVX a further extension of the vector instruction set or is it related to OpenMP directives? What compiler options are required to achieve this ?
I am expecting that AVX is available on Core i5 and similar processors. I am still using ifort Version 11 so is that a problem.
When I tried to improve performance 12 months ago, I had good success with the vector instruction set but not with OpenMP. I now have a new Core i5 so hope for better results when I transfer ifort to the new pc.

John
0 Kudos
TimP
Honored Contributor III
2,474 Views
The Core i5-2 and other "sandy bridge" and "ivy bridge" CPUs, support AVX 256-bit wide registers. Vectorized operations may double the data processed per instruction. Vectorized loops without full alignment will likely be compiled with AVX-128 but still may gain over SSE by skipping alignment adjustments which no longer are needed.
AVX compilation was first supported in ifort 11.1. This has no direct influence on OpenMP, but it may increase the size of data set needed to see the advantage of OpenMP.
I've noticed on my i5-2 that parallel loops take significantly (but unpredictably) longer to get up to speed; we think this is due to power-saving features working independently on the cores. In the extreme case, it may take 10 seconds for a core to come from shutdown to full speed. When all cores are up to speed, we should see better threading efficiency than on the original core-2.
0 Kudos
John_Campbell
New Contributor II
2,474 Views
Tim,

I am trying to understand the implications of your post on AVX and alignment for a Skyline storage method. It would be possible to provide16-byte or 32-byte alignment of each vector storage by ensuring that each column is stored starting at an appropriate memory location.

For 16-byte alignment and 8-byte reals, this requires that there are an even number of equations (columns)and the profile length of each column is even, adding one to each odd length column. When reducing each column by all it's dependent columns, I would need to store two vectors of the active column, one offset by 8-bytes, accumulating the update to both columns as the reduction progresses.
To test if 16-byte alignment is being achieved, I could use MOD (LOC (argument), 16) should = 1 ( or 0 ?).

If 32-byte alignment is required for full 256 bit, then the approach outlined above would not be effective.

Am I correct in my interpretation of how vector alignment is required for AVX ? I am assuming that both arguments to a Dot_Product should start on a 16-byte address.

I am puzzled by your comment of "Vectorized loops without full alignment". Is there a partial alignment, where the processor aligns the vectors, via the cache ?
My interpretation is each argumentto the Dot_Product is aligned or it isn't.

I am hoping to test this approach on a Core i5-2540M using ifort 64 Ver 11.1. Would this support AVX 256-bit wide registers?

Thanks for your advice.

John
0 Kudos
TimP
Honored Contributor III
2,474 Views
Alignments often confer a significant advantage in time to start up a loop, so they are particularly important for moderate length vector loops, perhaps up to length 100. The normal operation of ifort is to take some scalar iterations until the loop reaches a point of favorable alignment, and in some cases to generate multiple versions of a loop, one of the versions being optimized for both array operands to have the same alignment. In AVX code the compiler may choose to avoid the alignment steps. If the compiler knows of alignment, either by seeing how the array is declared, or by directive, it can improve the optimization. 32-byte alignment is needed to gain an alignment advantage for AVX; if the compiler doesn't know about alignment it may require more scalar steps for AVX.
Alignments are never required unless a directive is placed to cause code to be generated which requires alignment.
In a dot product, the compiler would adjust until one of the arrays is aligned, and will then allow for the other not being aligned.
The AVX compile option uses the AVX 256 bit registers for vectorization when it finds the alignment situation adequate. If not, it may choose AVX-128 code which doesn't get the full advantage which AVX-256 should have in favorable cases. The 11.1 compiler has basic AVX support, but the latest ones are improved particularly for AVX.
I'd hate to guess whether it would be worth the trouble in your skyline storage to insert padding so that each column starts at a 32-byte aligned location. If you test this, please let us know. It may even help more with 11.1 than with current compilers.
0 Kudos
Reply