fresh ideas

brisemec · ‎06-03-2003

I compiled a C++ function with and without MMX enabled. When I call the function on a XEON processor at 2000 MHz, I find the MMX version takes about 2700 clock cycles while the same function without MMX takes only 420 cycles. On a Pentium III at 1000 MHz it takes about 370 cycles for the non MMX version and 200 cycles for the MMX version. The behavior is nearly the same for a Celeron at 333 MHz or a Pentium IV 1700 MHz. My question is: Why does the MMX version performs so poorly on the XEON processor?

The operating systems on all machines is Windows XP Professional. The machine with the XEON has two processors. Maybe this is the reason?

The code used to count the cycles was the following:

inline unsigned __int64 ReadTSC() {
	unsigned __int64 value;
	_asm {
		rdtsc
		mov		dword ptr value, eax
		mov		dword ptr value + 4, edx
	}
	return value;
}

void CppOpt(char* i1, char* i2, char* res) {
#pragma ivdep
#pragma vector aligned 
	for(int i = 0; i < 64; ++i) {
		if(i2 != 0) res = i2;
		else res = i1;
	}
}

void main() {
	unsigned __int64 start = ReadTSC();
	CppOpt(i1, i2, res);
	unsigned __int64 end = ReadTSC();
	end -= begin;	
}

I'm using version 6 of the Intel C++ Compiler but I tried coding the function in assembly and the results were the same.

bronx · ‎06-04-2003

IMO your benchmark methodology is a likely explanation, you have to :
- initialize the arrays with interesting patterns
- run the test for a sufficiently long period

out of curiosity I've tested your example (slightly modified, see below) and I get 291-303 clock on a P4 2800 with the MMX code path (compiled with the latest icl 7.1)

bronx · ‎06-04-2003

after more experiments with your code, I've found a source-level reformulation leading to some speedups, see attached code. I now get 228-237 clock with the MMX version.

Worth noting : the SSE2 code path (-QxW) provides no significant speedup

serge-pashkov · ‎06-05-2003

> after more experiments with your code, I've found a
> source-level reformulation leading to some speedups,
> see attached code. I now get 228-237 clock with the
> MMX version.
>
> Worth noting : the SSE2 code path (-QxW) provides no
> significant speedup

I tried your code with my Pentium IV 2.0GHz W2k SP3.
Results:
MMX version (-QxM) is faster than SSE (-QxK) one
(165-171 clocks vs. 202-203).

SSE2 version (-QxW) crashes with The exception Privileged instruction (0xc0000096) at 0x0040100f
movdqa xmm4,XMMWORD PTR [eax] ;23.14

bronx · ‎06-05-2003

After a more careful analysis of the ASM I have observed that the actual test function was not called from the test loop, but another *less efficient* version was inlined instead (!)

the improved program spits out more interesting scores :

Test platform :
- P4 2800/i850/2xPC800
- Windows XP Pro
- ICL 7.1 (Build 20030424Z)
- Common compile flags : -O2 -G7

Scores :
base : 538-548
-Qxi : 539-547
-QxM : 64-65
-QxK : 64-66
-QxW : 41

> SSE2 version (-QxW) crashes with The exception

probably a lack of alignement to 16B boundaries, fixed in the new code

serge-pashkov · ‎06-06-2003

Really very interesting test.
I played a little with CppOpt function.

Results are rather strange.

When this function is free one (not a member of class) or
virtual member function the loop is vectorized
(corresponding times are 40-60).
But when the same function is non-virtual member or static member of class - no vectorization at all
(times ~ 470).

What is the reason of a such difference?

bronx · ‎06-06-2003

> Really very interesting test.

indeed, more than 10x speedup from compile flags alone is quite impressive, it's an ideal case for branches elimination and SIMD computations

> When this function is free one (not a member of
> class) or
> virtual member function the loop is vectorized

in my tests with icl 7.1 whenever the function is free, declared as static or not, defined before main() or forward declared and defined after main(), with and without explicit "inline" it is always inlined in the test loop, with a non-vectorized code. When it is visible from the outside (i.e. non-static) another version is generated (this one is vectorized for MMX targets and up) but not called within the main program.
Moreover, when it was a regular function member of some class or even a virtual function it was always inlined without vectorization, only the trick with the instance ptr + virtual method enforce an actual call to the vectorized code

> What is the reason of a such difference?

at this stage of the analysis it looks very much like an issue with the optimizer in ICL 7.1, the inlining "optimization" somewhat cancel out the vectorization optimization

brisemec · ‎06-09-2003

Thank you for your help, but I still can't understand why does my code run 10 times slower on a XEON processor than on a Pentium III or Pentium IV. here are the benchmarks I got:

Processor Cycles

PentiumII 400 MHz --------- 261
PentiumIII 1100 MHz ------- 169
PentiumIV 1700 MHz -------- 308
XEON 2000 MHz ------------ 2724

To get these results, I run the code about 1000000 times and calculate the mean cycle count.

bronx · ‎06-09-2003

have you tested with my latest example above (v0.3) ?
which compiler + flags do you use ?

bronx · ‎06-10-2003

brisemec, look at the end of this thread for a possible explanation :

RDTSC on SMP rigs

using calls like QueryPerformancCounter will fix the problem maybe, I'll be interested to hear about your findings

MMX and XEON