- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I compiled a C++ function with and without MMX enabled. When I call the function on a XEON processor at 2000 MHz, I find the MMX version takes about 2700 clock cycles while the same function without MMX takes only 420 cycles. On a Pentium III at 1000 MHz it takes about 370 cycles for the non MMX version and 200 cycles for the MMX version. The behavior is nearly the same for a Celeron at 333 MHz or a Pentium IV 1700 MHz. My question is: Why does the MMX version performs so poorly on the XEON processor?
The operating systems on all machines is Windows XP Professional. The machine with the XEON has two processors. Maybe this is the reason?
The code used to count the cycles was the following:
I'm using version 6 of the Intel C++ Compiler but I tried coding the function in assembly and the results were the same.
The operating systems on all machines is Windows XP Professional. The machine with the XEON has two processors. Maybe this is the reason?
The code used to count the cycles was the following:
inline unsigned __int64 ReadTSC() { unsigned __int64 value; _asm { rdtsc mov dword ptr value, eax mov dword ptr value + 4, edx } return value; } void CppOpt(char* i1, char* i2, char* res) { #pragma ivdep #pragma vector aligned for(int i = 0; i < 64; ++i) { if(i2 != 0) res = i2; else res = i1; } } void main() { unsigned __int64 start = ReadTSC(); CppOpt(i1, i2, res); unsigned __int64 end = ReadTSC(); end -= begin; }
I'm using version 6 of the Intel C++ Compiler but I tried coding the function in assembly and the results were the same.
Link Copied
9 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
IMO your benchmark methodology is a likely explanation, you have to :
- initialize the arrays with interesting patterns
- run the test for a sufficiently long period
out of curiosity I've tested your example (slightly modified, see below) and I get 291-303 clock on a P4 2800 with the MMX code path (compiled with the latest icl 7.1)
- initialize the arrays with interesting patterns
- run the test for a sufficiently long period
out of curiosity I've tested your example (slightly modified, see below) and I get 291-303 clock on a P4 2800 with the MMX code path (compiled with the latest icl 7.1)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
after more experiments with your code, I've found a source-level reformulation leading to some speedups, see attached code. I now get 228-237 clock with the MMX version.
Worth noting : the SSE2 code path (-QxW) provides no significant speedup
Worth noting : the SSE2 code path (-QxW) provides no significant speedup
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
> after more experiments with your code, I've found a
> source-level reformulation leading to some speedups,
> see attached code. I now get 228-237 clock with the
> MMX version.
>
> Worth noting : the SSE2 code path (-QxW) provides no
> significant speedup
I tried your code with my Pentium IV 2.0GHz W2k SP3.
Results:
MMX version (-QxM) is faster than SSE (-QxK) one
(165-171 clocks vs. 202-203).
SSE2 version (-QxW) crashes with The exception Privileged instruction (0xc0000096) at 0x0040100f
movdqa xmm4,XMMWORD PTR [eax] ;23.14
> source-level reformulation leading to some speedups,
> see attached code. I now get 228-237 clock with the
> MMX version.
>
> Worth noting : the SSE2 code path (-QxW) provides no
> significant speedup
I tried your code with my Pentium IV 2.0GHz W2k SP3.
Results:
MMX version (-QxM) is faster than SSE (-QxK) one
(165-171 clocks vs. 202-203).
SSE2 version (-QxW) crashes with The exception Privileged instruction (0xc0000096) at 0x0040100f
movdqa xmm4,XMMWORD PTR [eax] ;23.14
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
After a more careful analysis of the ASM I have observed that the actual test function was not called from the test loop, but another *less efficient* version was inlined instead (!)
the improved program spits out more interesting scores :
Test platform :
- P4 2800/i850/2xPC800
- Windows XP Pro
- ICL 7.1 (Build 20030424Z)
- Common compile flags : -O2 -G7
Scores :
base : 538-548
-Qxi : 539-547
-QxM : 64-65
-QxK : 64-66
-QxW : 41
> SSE2 version (-QxW) crashes with The exception
probably a lack of alignement to 16B boundaries, fixed in the new code
the improved program spits out more interesting scores :
Test platform :
- P4 2800/i850/2xPC800
- Windows XP Pro
- ICL 7.1 (Build 20030424Z)
- Common compile flags : -O2 -G7
Scores :
base : 538-548
-Qxi : 539-547
-QxM : 64-65
-QxK : 64-66
-QxW : 41
> SSE2 version (-QxW) crashes with The exception
probably a lack of alignement to 16B boundaries, fixed in the new code
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Really very interesting test.
I played a little with CppOpt function.
Results are rather strange.
When this function is free one (not a member of class) or
virtual member function the loop is vectorized
(corresponding times are 40-60).
But when the same function is non-virtual member or static member of class - no vectorization at all
(times ~ 470).
What is the reason of a such difference?
I played a little with CppOpt function.
Results are rather strange.
When this function is free one (not a member of class) or
virtual member function the loop is vectorized
(corresponding times are 40-60).
But when the same function is non-virtual member or static member of class - no vectorization at all
(times ~ 470).
What is the reason of a such difference?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
> Really very interesting test.
indeed, more than 10x speedup from compile flags alone is quite impressive, it's an ideal case for branches elimination and SIMD computations
> When this function is free one (not a member of
> class) or
> virtual member function the loop is vectorized
in my tests with icl 7.1 whenever the function is free, declared as static or not, defined before main() or forward declared and defined after main(), with and without explicit "inline" it is always inlined in the test loop, with a non-vectorized code. When it is visible from the outside (i.e. non-static) another version is generated (this one is vectorized for MMX targets and up) but not called within the main program.
Moreover, when it was a regular function member of some class or even a virtual function it was always inlined without vectorization, only the trick with the instance ptr + virtual method enforce an actual call to the vectorized code
> What is the reason of a such difference?
at this stage of the analysis it looks very much like an issue with the optimizer in ICL 7.1, the inlining "optimization" somewhat cancel out the vectorization optimization
indeed, more than 10x speedup from compile flags alone is quite impressive, it's an ideal case for branches elimination and SIMD computations
> When this function is free one (not a member of
> class) or
> virtual member function the loop is vectorized
in my tests with icl 7.1 whenever the function is free, declared as static or not, defined before main() or forward declared and defined after main(), with and without explicit "inline" it is always inlined in the test loop, with a non-vectorized code. When it is visible from the outside (i.e. non-static) another version is generated (this one is vectorized for MMX targets and up) but not called within the main program.
Moreover, when it was a regular function member of some class or even a virtual function it was always inlined without vectorization, only the trick with the instance ptr + virtual method enforce an actual call to the vectorized code
> What is the reason of a such difference?
at this stage of the analysis it looks very much like an issue with the optimizer in ICL 7.1, the inlining "optimization" somewhat cancel out the vectorization optimization
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you for your help, but I still can't understand why does my code run 10 times slower on a XEON processor than on a Pentium III or Pentium IV. here are the benchmarks I got:
Processor Cycles
PentiumII 400 MHz --------- 261
PentiumIII 1100 MHz ------- 169
PentiumIV 1700 MHz -------- 308
XEON 2000 MHz ------------ 2724
To get these results, I run the code about 1000000 times and calculate the mean cycle count.
Processor Cycles
PentiumII 400 MHz --------- 261
PentiumIII 1100 MHz ------- 169
PentiumIV 1700 MHz -------- 308
XEON 2000 MHz ------------ 2724
To get these results, I run the code about 1000000 times and calculate the mean cycle count.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
have you tested with my latest example above (v0.3) ?
which compiler + flags do you use ?
which compiler + flags do you use ?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
brisemec, look at the end of this thread for a possible explanation :
RDTSC on SMP rigs
using calls like QueryPerformancCounter will fix the problem maybe, I'll be interested to hear about your findings
RDTSC on SMP rigs
using calls like QueryPerformancCounter will fix the problem maybe, I'll be interested to hear about your findings
Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page