- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I've just read this thread, and also from word of mouth, have heard that the Intel compiler does better with SSE intrinsics than MSVC. We currently use VS2005, and are supporting SSE2 and higher only. The MSVC compiler does a less-than-stellar job avoiding XMM register spilling, etc.
I am wondering if there's a nice how-toon using the Intel compiler with the VS2005 ide, or at least using the Intel compiler for simd-ciritical portions of our code.
I have the evaluation latest version of the Intel compilercurrently.
Thanks,
William
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Apparently I should've looked in the help docs (or even just opened up VS after installation). I'll get back to you. :)
Thanks,
William
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Ok, after conversion I had some linker errors, as well as a bunch of internal compiler errors in one of our vcproj's (now icproj's), but overall, a pretty painless procedure. Linker errors fixed, but the ICE's couldn't be avoided, so just avoiding those projects for now.
Anyhow, at runtime, I'm getting a misaligned vector load exception. Here is the code casuing it. If I remove the static const, it works, but this is clearly not optimal. Any ideas why it's not aligning the global vector? (This works in VS2005.)
template
__forceinline const __m128 VecConstant()
{
static const __declspec(align(16)) u32 s_vect[4] = { floatAsIntX, floatAsIntY, floatAsIntZ, floatAsIntW };
return *(__m128*)(&s_vect);
}
EDIT: It seems to be__declspec() getting ignored, since aligned-vector members of other classes are not getting aligned either.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
This is horrible, but seems that moving the __declspec(align(16)) before the "static" keyword fixed the alignment for that one particular case... :-/ Still working on the other misalignments, which may be due to our own allocator.
Sorry for being a1-man thread lately. Will post problems as they arise.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Ok, the problem seems to be the following (compiler bug?):
pPtr = new CMyClass[m_amount];
where CMyClass has the __declspec(align(16)) specifier (and additionally has a 16-byte-aligned object as its first member). CMyClass has zero virtual functions. In the CMyClass constructor, the first member is set to zero, which calls _mm_setzero_ps() internally, causing a misalignment exception -- that first member isn't aligned properly -- everything is shifted over by one word.
The problem is with the new[] operator. The word stored at the beginning of the array (to hold the size of the array) isn't 16-byte-aligned, and is pushing everything to the right by 1 word instead of by 4 words.
Has no one else discovered this problem?
EDIT: Anyhow, I got around it using _aligned_malloc() + placement newfor now, buta fix would be better of course.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Tim, the problem is that the Intel compiler has a bug (which I happened to report long time ago #458830) where if you use placement new and aligned malloc() you still do not get aligned memory back when allocating with new[] because the compiler stores number of array elements at the beginning of allocated memory and increments the pointer it returns. Try compiling attached test.cpp with MSVC and then with ICC. Note that support says that the issue has been fixed, but obviously the fix is not available in 10.1.021.
To cut the long story short, I would also like to see that bug resolved as quickly as possible instead of having to write additional code. We are already waiting three months for that fix, and that is mighty slow for such a serious showstopper.
William, if you want to compare the code generated by MSVC and ICC from your intrinsics, it is very important that you set /arch:SSE2 (in addition to /O2 or /Ox) for MSVC compiler or otherwise you will get suboptimal code because compiler will still use floating point and perform some needless type conversions.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
The most optimal way to overload new and delete to ensure alignement in classes in C++ is given below. Interms of performance_mm_malloc works better than _aligned_malloc, as the behaviour of_mm_malloc is well known for theIntel Compiler.
Best Regards,
Lars Petter Endresen
//==================================================================
// Fix to make _mm_ functions work within classes in Microsoft C++
#include
#define _aligned_free(a) _mm_free(a)
#define _aligned_malloc(a, b) _mm_malloc(a, b)
void* operator new(size_t bytes) { return _mm_malloc(bytes,16); }
void* operator new[](size_t bytes) { return _mm_malloc(bytes,16); }
void operator delete(void* ptr) { _mm_free(ptr); }
void operator delete[](void* ptr) { _mm_free(ptr); }
//==================================================================
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Lars, with all due respect but you have disregarded completely what I wrote above. There is a bug in Intel Compiler preventing what you are suggesting from working correctly.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear Igor,
Sorry, I did not mean to disregard your point. Indeed your sample code crashes and I agree that this should be fixed. However, I have never experienced this problem myself, as I chose to implement things in a different manner. So for me my overload trick works every time! Also, I noticed that removing the destructor in your code resolved the problem, I cannot figure out why? Can you? If you anyway implement new and delete yourself, maybe you can avoid thedestructor altogether? This should fix your problem.
The topic of this thread is "VS2005 vs. Intel C++ Compiler, w.r.t. SSE2+" andIntel C++ Compiler has so many advantages over VS2005 that your particular problemmay seema littleout of focus(as you know all compilers have known limitations). To address the focus of this topic more, try to compile the below code with Intel and Microsoft C++. With Intel C++ 10.0 this code is automatically vectorized, parallelized, unrolled, jammed and cache blocked. The result is astonishing [Intel Core2 CPU 6400 @ 2.13GHz]:
- VS2005 C++: 75.29 seconds
- Intel 10.0 C++: 9.09 seconds
Compiler options used are:
- Microsoft C++: "/O2 /fp:fast /arch:SSE2 /STACK:1000000000"
- Intel C++: "-Qipo -O3 -QxT -Qparallel /STACK:1000000000"
Best Regards,
Lars Petter Endresen
#include#include #include #define SIZE 4000 int main() { int i, j, k; float a[SIZE][SIZE],b[SIZE][SIZE],c[SIZE][SIZE]; clock_t start, finish; for(i = 0; i < SIZE; i++){ for(j = 0; j < SIZE; j++){ b = a = (float)rand()/RAND_MAX; c = 0.0; } } start = clock(); for(i=0;i = c + a * b ; finish = clock(); printf("%f %f ",(float)(finish - start)/CLOCKS_PER_SEC, c[0][0]); }
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Also, I noticed that removing the destructor in your code resolved the problem, I cannot figure out why? Can you? If you anyway implement new and delete yourself, maybe you can avoid the destructor altogether?
Removing the destructor should not make any difference in that code, the fact that it does suggests that the bug may be more complex than I described.
As for new and delete my point is that I do not want (and other people shouldn't have to) implement new and delete to get alignment to work.
In my opinion, compiler should honor __declpsec(align(16)) if present even if you allocate the object via new[], or even better properly align all known data types (including __m128) without the need for any explicit alignment directives.
As for avoiding destructor, what if you need to derive a class which also has to do some house-keeping of its own? What if you need virtual destructor? Sorry, removing it is not a solution, just a workaround and a lousy one at best because it might break in the future.
As for the topic of this thread I am surprised that you find my posting here off-topic yet you seem to have missed the topic yourself. He has asked how much better ICC is compared to MSVC with regard to SIMD intrinsic code generation and I have explained how to get the best code from SIMD intrinsics using MSVC so he could do a fair comparison on his own. I also warned him about the bug because it seemed to me from his last post that he was also affected.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
A number of other advantages of Intel C++ over Microsoft C++,
- Intel C++ is more ISO standard conformant than Microsoft C++.
- Intel C++ gives more warnings than Microsoft C++.
- Intel C++ is platform independent; the same compiler is available for Windows, Linux and Mac. Microsoft C++ is not.
- Intel C++ contains a useful code coverage tool. Microsoft C++ does not.
- Intel C++ gives up to 10x speedup, relative to Microsoft C++,typically in floating point or multimediaintensive code.
- Intel C++ supports SIMD calculations. Microsoft C++ does not.
- Intel C++ is reported to being used by (surprise...) Microsoft themselves.
- Intel C++ has special compiler switches to trap uninitialized variables runtime. Microsoft C++ does not.
- Intel C++ optimizes better for size than Microsoft C++ as SIMD instructions can reduce code size too.
- Intel C++ with automatic vectorization isautomatically adapted tothe future AVX (http://softwareprojects.intel.com/avx/) 256 bit instruction set. Microsoft C++ does not support automatic vectorization.
I think that I disagree fundamentally with your software development philosophy Igor, as an engineer better should adapt to the tools at hand than to look for situations in which the tools fail. If you have 2 hours to complete a software release, you better remove your destructor and take the pragmatical approach to make things work - all tools havea vast number of limitaions and investigating all of them may delay any software project to a point where your customers becomes very unhappy. Writing a C++ compiler is notoriously difficult, in particular if you both desire maximum performance and correct behaviour according to the language standard. This is particularly true for some of the more advanced topics in the C++ language standard, like templates and polymorphism.So, to achive maximum performance, remember tofollow two simple rules:
-
Write compiler friendly code.
-
Use the right compiler.
When it comes down to codes involving floating point instructions like "mulps" and "divps", I am a little surprised that developers still hand optimize code using intrinsics, why not rely on automatic vectorization here? Then you do not need to rewrite your software every time the SIMD vector length is being increased. Today, the SIMD vector length is 128 bit, with AVX it will be 256 bit, and later it will be extended to 512 and 1024 bit (Vector Future FP support to 512 bits and even 1024 bits.) However, there are situations where SSEx intrinsics may be unavoidable, and thus I support your point that the alignement issue should be fixed. I did some more testing, and I found that the failure is seen only when the destructor is present - maybe the destructor is the origin of the problem?
Best Regards,
Lars Petter Endresen
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Ok lets see what claims are bogus:
2. Serious developers use lint anyway so a chatty compiler isn't an advantage.
7. Saying this without citing the source is just a rumor.
8. In debug mode all uninitialized variables are trapped anyway.
9. This is not true. MSVC generates smaller code and uses less memory for constants.
10. It is not fair to compare this because AVX specification has only appeared in public, while Intel had it internally for quite some time.
Furthermore, saying that MSVC doesn't support SIMD is misleading. You can use intrinsics with equal success in MSVC.
I think that I disagree fundamentally with your software development philosophy Igor, as an engineer better should adapt to the tools at hand than to look for situations in which the tools fail.
Lars, your logic is flawed. I wasn't actively looking for this failure I stumbled upon it because my perfectly legal and moral C++ code crashed. I have a workaround but I want a permanent fix because if I rely on a workaround the workaround itself might stop working when fix is introduced.
If you have 2 hours to complete a software release, you better remove your destructor and take the pragmatical approach to make things work
You still haven't answered my question what if I need the destructor? Did it cross your mind that the code sample I gave here and on premier support is vastly simplified?
As for "writing compiler friendly code" that's a two edged sword. What is friendly for one compiler might not be friendly at all for another.
I am a little surprised that developers still hand optimize code using intrinsics, why not rely on automatic vectorization here?
It is actually very simple:
- Compiler cannot vectorize everything (try type conversions for example or code like a[b]).
- Code written with intrinsics will work almost as fast if you compile it using MSVC.
- New compiler versions may introduce regressions where something that vectorized earlier does not vectorize anymore, or it suddenly has lower performance because engineers had to make a trade-off somewhere.
I did some more testing, and I found that the failure is seen only when the destructor is present - maybe the destructor is the origin of the problem?
If you bothered to read my post about the bug more carefully, or at least to run that sample code in a debugger you would have noticed that the compiler stores the number of array elements at the beginning of the allocated memory and then increments the pointer and passes that incremented (and thus unaligned) pointer when it returns from new[].
That number is used so the compiler knows how many times it has to invoke the destructor. If you remove the destructor then most likely the number of array elements does not get stored and the alignment stays correct.
EDIT:
Lars, I am having trouble getting your test code to work with ICC 10.1.021. It compiles but it crashes with exception 0xC00000FD even though I passed /STACK option. Same happens if I remove /STACK, compile to object and then link separately with /STACK.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
> Code written with intrinsics will work almost as fast if you compile it using MSVC.
Igor, I challenge you to beat the 4000x4000 float matrix multiplicationI posted earlier in this thread using any tool you want MSVC, intrinsics or inline assembly. If you come even close to the performance of Intel C++ I will be utterly surprised, because even the hand-optimized Intel MKL library is slower. Oh, there are many such examples where no human beingin alimited amount of time can beatthe IntelFORTRAN or C++ Compiler. Nor do most people want to dig so deep down into intrinsics or assembly code to resolve such standard tasks as matrix multiplication!
BTW, I would suggest reading the excellent manuals of Agner Fog before you start...
Best Regards,
Lars Petter
> It compiles but it crashes with exception 0xC00000FD even though I passed /STACK option.
Sorry for the confusion Igor, this is an option to the linker.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Then I guess I am at the slight advantage because I read those manuals long time ago.
By the way, I know that /STACK is a linker option. However code compiled with ICC 10.1.021 still crashes with said exception. Any ideas?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
IgorLevicki:As for avoiding destructor, what if you need to derive a class which also has to do some house-keeping of its own? What if you need virtual destructor? Sorry, removing it is not a solution, just a workaround and a lousy one at best because it might break in the future.
My experience is that advanced C++ and high performance are not always good friends. Well, for the sake of software structure you may use advanced language features like inheritence, virtual functions and templates, but if you are not careful, this may also lead to poor performance - it is a common misconception among most C++ programmers that C++ actually is so useful in supercomputing. They have all been mislead by Bjarne Stroustrup's The C++ Programming Language (2007). He states that "Consequently, the standard library provides a vector - called valarray - designed specifically for speed of the usual numeric vector operations. Following his code (pp 662-674),
#include#include #include #include #define SIZE 4000 template class Slice_iter { std::valarray *v; std::slice s; size_t curr; T& ref(size_t i) const {return (*v)[s.start()+i*s.stride()]; } public: Slice_iter(std::valarray *vv, std::slice ss): v(vv), s(ss), curr(0) { } T& operator[](size_t i){ return ref(i); } T& operator[](size_t i) const { return ref(i); } }; template class Cslice_iter { std::valarray *v; std::slice s; size_t curr; T& ref(size_t i) const {return (*v)[s.start()+i*s.stride()]; } public: Cslice_iter(std::valarray *vv, std::slice ss): v(vv), s(ss), curr(0) { } T& operator[](size_t i) const { return ref(i); } }; class Matrix { std::valarray< float > *v; size_t r, c; public: Matrix(size_t x, size_t y){r=x;c=y;v = new std::valarray (0.,x*y);} Slice_iter< float > row(size_t i ) { return Slice_iter< float >(v, std::slice(i*c, c, 1)); } Cslice_iter< float > row(size_t i ) const { return Cslice_iter< float >(v, std::slice(i*c, c, 1)); } Slice_iter< float > operator[](size_t i) {return row(i);} Cslice_iter< float > operator[](size_t i) const {return row(i);} }; int main() { int i, j, k; Matrix a(SIZE,SIZE),b(SIZE,SIZE),c(SIZE,SIZE); clock_t start, finish; for(i = 0; i < SIZE; i++){ for(j = 0; j < SIZE; j++){ b = a = (float)rand()/RAND_MAX; c = 0.0; } } start = clock(); for(i=0;i = c + a * b ; finish = clock(); printf("%f %f ",(float)(finish - start)/CLOCKS_PER_SEC, c[0][0]); }
extends the 4000x4000 float matrix multiplication simulation time to 620.2 seconds, even with the latest Intel C++
10.0 compiler. I have seen developers argueing that Intel C++ Compiler is no better than other compilers when the actual problem is that the software they have written is completely resistant to any compiler optimizations, exactly like the above Matrix suggested by Bjarne Stroustrup.
Best Regards,
Lars Petter Endresen.
Sorry for the confusion Igor - you need to set a global enironment variable KMP_STACKSIZE to 512m.When you have an executable that runs, please report back the fastest of 5 successive runs, because the programs runs better when the caches are warm.
To help you out Igor, I have posted a showstopper to premier support "Cannot align class with destructorIssue Number 481167". My experience with premier support is that they usually resolve issues quickly, in particular ifI manage to convince them that this is of importance for the general public. I never write any code with destructors, that's why I never saw your crash Igor (I like classes and the many other features of C++ like copy-constructors and in particular the composition paradigm which is good for performance).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
He states that "Consequently, the standard library provides a vector - called valarray - designed specifically for speed of the usual numeric vector operations.
Well, I haven't tested those and I do not intend to use them. I have noticed however that ICC generally has poorer performance than even MSVC when working with templates and some of more bizarre aspects of C++ such as that array you are mentioning.
I have seen developers argueing that Intel C++ Compiler is no better than other compilers when the actual problem is that the software they have written is completely resistant to any compiler optimizations, exactly like the above Matrix suggested by Bjarne Stroustrup.
And they are most likely right — not only it is not better but it seems to be worse.
Sorry for the confusion Igor - you need to set a global enironment variable KMP_STACKSIZE to 512m.
I solved that by making the arrays global.
To help you out Igor, I have posted a showstopper to premier support "Cannot align class with destructor Issue Number 481167".
It annoys me to no end when someone skims over my posts when I take so much time to write them as precisely as possible... Lars why haven't you read my first post? I have already reported that alignment problem as a showstopper three months ago, and now they have two reports which may actually delay the fix until they figure out it is a duplicate.
As for your code sample, MSVC takes 61.67 seconds. ICC takes 7.55 seconds on Core 2 Duo E8200 (2.66GHz).
Simple change (which took me less than 2 minutes) in assembler code generated by MSVC from:
lea edi, DWORD PTR [edx*4] npad 3 $LL20@mat_mul_c: ; Line 17 movss xmm0, DWORD PTR [ecx] mulss xmm0, DWORD PTR [esi] addss xmm0, DWORD PTR [eax] movss DWORD PTR [eax], xmm0 movss xmm0, DWORD PTR [ecx+4] mulss xmm0, DWORD PTR [esi] addss xmm0, DWORD PTR [eax+4] movss DWORD PTR [eax+4], xmm0 movss xmm0, DWORD PTR [ecx+8] mulss xmm0, DWORD PTR [esi] addss xmm0, DWORD PTR [eax+8] movss DWORD PTR [eax+8], xmm0 movss xmm0, DWORD PTR [ecx+12] mulss xmm0, DWORD PTR [esi] addss xmm0, DWORD PTR [eax+12] movss DWORD PTR [eax+12], xmm0
To:
lea edi, DWORD PTR [edx*4] movss xmm1, dword ptr [esi] shufps xmm1, xmm1, 0 $LL20@mat_mul_c: ; Line 17 movaps xmm0, xmmword ptr [ecx] mulps xmm0, xmm1 addps xmm0, xmmword ptr [eax] movaps xmmword ptr [eax], xmm0
Brings the time down from 61.67 to 42.64 seconds and it is still single-threaded and not cache optimized. Should I continue?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
IgorLevicki:And they are most likely right not only it is not better but it seems to be worse.
Let us put ourselves in the shoes of Intel Compiler Team - should they honor stupid people writing stupid software or smart people writing smart software? I mean, they have limited resources and must do some choices, what is the most important feature of our compiler? I couple of years ago they "merged" the advanced optimizations of Compaq Visual Fortran, first into Intel Visual Fortran and then into Intel C++, meaning that similar semantics in C++ and FORTRAN gives almost exactly the same assembly code and performance (try yourselves!), in particular if you write FORTRAN style C code, using features in C99 like restrict. But this is not the approach of most C++ programmers that have been mislead by many C++ books like "Effective C++" and "MoreEffective C++".
To summarize, dependent of the problem at hand use,
- Intel C++ without intrinsics,
- Intel C++ with intrinsics,
- Microsoft C++ with/without intrinsics.
In large software projects I prefer to use Microsoft C++ for the code that is not performance critical because it compiles so quickly, but try to write as much as possible of the performance critical code without using intrinsics. My experience is also that many well written codes written for other platforms or systems gives excellent performance "out of the box" with Intel C++, because many developers have known for a long time what is meant by "compiler friendly".
You do not need to improve the matrix multiplication any more, my point is that in many cases it is quite difficult to match the compiler with your own code (implementing unroll-and-jam may also be a laborious process...)
Best Regards,
Lars Petter Endresen
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
IgorLevicki:And they are most likely right not only it is not better but it seems to be worse.
Let us put ourselves in the shoes of Intel Compiler Team - should they honor stupid people writing stupid software or smart people writing smart software? I mean, they have limited resources and must do some choices, what is the most important feature of our compiler? I couple of years ago they "merged" the advanced optimizations of Compaq Visual Fortran, first into Intel Visual Fortran and then into Intel C++, meaning that similar semantics in C++ and FORTRAN gives almost exactly the same assembly code and performance (try yourselves!), in particular if you write FORTRAN style C code, using features in C99 like restrict. But this is not the approach of most C++ programmers that have been mislead by many C++ books like "Effective C++" and "MoreEffective C++".
To summarize, dependent of the problem at hand use,
- Intel C++ without intrinsics,
- Intel C++ with intrinsics,
- Microsoft C++ with/without intrinsics.
In large software projects I prefer to use Microsoft C++ for the code that is not performance critical because it compiles so quickly, but try to write as much as possible of the performance critical code without using intrinsics. My experience is also that many well written codes written for other platforms or systems gives excellent performance "out of the box" with Intel C++, because many developers have known for a long time what is meant by "compiler friendly".
You do not need to improve the matrix multiplication any more, my point is that in many cases it is quite difficult to match the compiler with your own code (implementing unroll-and-jam may also be a laborious process...)
Best Regards,
Lars Petter Endresen
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Igor,
Take a look at an independent FORTRAN compiler benchmark here, you'll easily figure out which compiler is the preferred choice. If someone would have taken the effort to translate all these benchmarks to C99 using restrict, the performance advantage of Intel C++ would be the same - I do not think Microsoft C++ would be able to compete in this "formula one league" of supercomputing compilers.
Best Regards,
Lars Petter Endresen
Absoft 10.0.8 |
ftn95 5.20.0 |
g95 0.91 |
gfortran 4.3.0 |
intel 10.1.011 |
Lahey 7.10.0 |
Nag 5.0 |
pgi 7.1-6 |
|
AC | 27.57 | 29.86 | 31.76 | 19.61 | 11.78 | 33.84 | 38.49 | 34.22 |
AERMOD | 39.06 | 78.82 | 96.88 | 52.32 | 35.54 | 60.81 | 87.58 | 41.99 |
AIR | 13.93 | 29.48 | 18.12 | 15.89 | 10.61 | 19.72 | 16.13 | 15.00 |
CAPACITA | 69.47 | 121.63 | 84.06 | 61.80 | 65.88 | 95.49 | 88.95 | 68.31 |
CHANNEL | 6.69 | 11.50 | 22.66 | 3.27 | 3.72 | 7.73 | 6.63 | 3.78 |
DODUC | 74.19 | 127.84 | 86.29 | 74.35 | 51.78 | 89.10 | 113.25 | 72.95 |
FATIGUE | 13.34 | 37.07 | 98.65 | 21.21 | 13.93 | 27.05 | 31.93 | 18.40 |
GAS_DYN | 7.87 | 69.73 | 69.04 | 14.33 | 4.93 | 23.70 | 71.43 | 53.50 |
INDUCT | 95.03 | 182.72 | 111.08 | 92.63 | 84.99 | 170.05 | 160.80 | 83.64 |
LINPK | 25.34 | 25.78 | 26.33 | 25.41 | 25.31 | 25.81 | 25.60 | 25.94 |
MDBX | 24.83 | 55.41 | 24.09 | 21.81 | 23.31 | 38.84 | 26.99 | 23.34 |
NF | 33.71 | 57.03 | 59.16 | 37.28 | 28.96 | 45.31 | 36.42 | 32.34 |
PROTEIN | 58.98 | 116.69 | 106.97 | 62.35 | 58.04 | 96.71 | 85.88 | 73.88 |
RNFLOW | 44.42 | 54.12 | 51.04 | 40.52 | 50.44 | 45.31 | 55.05 | 52.43 |
TEST_FPU | 19.81 | 30.70 | 33.91 | 17.36 | 14.79 | 20.99 | 22.18 | 17.63 |
TFFT | 3.95 | 5.50 | 4.27 | 3.67 | 3.56 | 3.96 | 4.34 | 4.05 |
Geometric Mean | 24.94 | 46.68 | 44.11 | 24.98 | 20.35 | 34.89 | 37.49 | 28.42 |
Compiler Switches | |
Absoft |
f95 -V -m32 -Ofast -speed_math=9 -WOPT:if_conv=off -LNO:fu=9:full_unroll_size=7000 -march=em64t -H60 -xINTEGER -stack:0x8000000 |
FTN95 | ftn95 /p6 /optimise (slink was used to increase the stack size) |
g95 | g95 -march=nocona -ffast-math -funroll-loops -O3 |
gfortran | gfortran -march=native -funroll-loops -O3 |
Intel | ifort /O3 /Qipo /QxT /Qprec-div- /link /stack:64000000 |
Lahey | lf95 -inline (35) -o1 -sse2 -nstchk -tp4 -ntrace -unroll (6) -zfm |
NAG | f95 -O4 -V |
PGI | pgf90 -Bstatic -V -fastsse -Munroll=n:4 -Mipa=fast,inline -tp core2 |
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page