- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am working on a code that is used as a user-subroutine for a commercial finite element software. The code gets compiled to a dll file and loaded by the commercial software during runtime. The code consists of complex data structures and tens of subroutine calls.
Recently I have applied some modifications to the code to make it thread safe (changes include turning subroutines to recursive, etc.)
The modified code works as fast as old code when compiled and run under linux environment (see https://software.intel.com/en-us/comment/1799692#comment-1799692) .
But under Windows, the modifications have led to significant performance drops (~3 times slower).
I suspect the difference come from the fact that switching subroutine to recursive, turns its memory allocation from static to stack and memory allocation leads to slow down. What I don't understand is the same code runs as fast as the old code under linux but 3 times slower than the old code under windows.
Suggestions on how to improve the performace is highly appreciated.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It seems more likely that you missed optimizations, possibly by not translating your compile options to Windows form. You should be able to compare /Qopt-report:4/-qopt-report4 to check this.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In addition to different optimization options, you may have mistakenly left on some runtime checks.
Can you provide the compiler options use in creating the DLL and the app that calls the DLL?
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Compiler options in Windows: /compile_only /free /debug:full /Zi /traceback /Warn:alignments /Warn:declarations /Warn:errors /Warn:general /align:dcommons /check:all /traceback /nologo /module:modules /Qzero /iface:cref
Compiler options in Linux: -c -warn alignments -warn declarations -warn errors -check bounds -g -static -module modules -fPIC -w -threads -reentrancy threaded -debug
Linker options in Windows: '/nologo', '/NOENTRY', '/INCREMENTAL:NO', '/subsystem:console', '/machine:AMD64',
'/debug', '/FIXED:NO', '/dll'
Linker options in Linux: fortCmd,
'-V',
'-cxxlib', '-fPIC', '-threads', '-shared',
'/opt/intel/Compiler/11.1/080/lib/intel64/libiomp5.a',
'%E', '-Wl,-soname,%U', '-o', '%U', '%F', '%A', '%L', '%B', '-parallel',
'-Wl,-Bdynamic', '-i-dynamic', '-lifport', '-lifcoremt', '-lmpi',
'-Wl,-Bstatic',
'-Wl,-Bdynamic',
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
debug options disable optimization unless you specify it explicitly, e.g. -O2 or -O3, so it doesn't make a lot of sense to be concerned about performance without taking care of that.
check:all may incur more overhead than check bounds, which already is costly. If you didn't need a -zero or dcommons option for linux, it seems unlikely to be required on Windows, although this shouldn't have much affect on performance.
-align array32byte / -align:array32byte may be useful on both Windows and linux.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks Tim. Your response in the VTune thread (https://software.intel.com/en-us/forums/topic/534288#comment-1802482) helped see the part of the code that was choking and find the source of the problem.
The problem was that I had a large data structure created in one of these recursive subroutines. creating a work around that fixed the time spent on memory allocation.
It's interesting that the memory allocation didn't create much overhead on the linux side as the codes are identical.
Thanks,
Alireza
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
When using the /debug build this loads the debug version of the CRTL heap manager. Under Windows it does a lot of checking on allocations and deallocations. Linux may not be as aggressive.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks Jim for the response. Removing the debug option and modifying that 1 subroutine that was taking a lot of memory allocation, made the code run very fast.
Alireza
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I tried the -O2 optimization option (Linux) and now my FE software returns runtime error saying:
undefined symbol: _intel_fast_memmove
Considering that our customers do not necessarily have Intel compilers installed on their machines/clusters, I wonder if I need to statically link some libraries with our .so/.dll file?
Also I wonder what are the libraries that need to be linke.
Thanks,
Alireza
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yes, you need to provide Intel redistributables. See https://software.intel.com/en-us/articles/intelr-composer-redistributable-libraries-by-version
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks Steve for the response. So when linking, shall I include all the .dll/.so files in the redist/mkl folder?
The other comment is that to avoid shipping the redist libraries, I can statically link them, right? What is the down side of that except for creation of a larger .dll file?
Thanks,
Alireza
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Well, it depends on which libraries you're using. For example, the OpenMP library is provided in DLL form only. If you can link statically, it will remove the dependence on other Intel (and MSVC) DLLs.
You should use DependencyWalker on your DLL to see which DLLs you need. If you're not going to use the compiler's redist installer (which does not include MKL), then copy only the DLLs referenced. But keep in mind that these will need to be on PATH, not just in the same folder as your DLL.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page