64- vs 32-bit performance -- don't know where else to turn

xorpd · ‎04-27-2007

I have tried to get a measure for the performance ratio between 64- and 32-bit code, but the forum I posted to was inhabited by denizens who apparently had such clunkers that they couldn't run the benchmark. All it requires is a machine running 64-bit windows. There are 10 benchmarks, 2 32-bit ones and 8 64-bit ones. All of them are brief, only a couple of seconds on a fast machine. Of the 64-bit banchmarks it's only necessary to run the KMB_V0.57_2T_X?.exe ones if your machine has no more than 2 cores counting hyperthreading or KMB_V0.57_MT_X?.exe if you have more than 2 cores.

Useful results look like this:

processor: core 2 duo e6700

speed: 2.66 GHz

sockets * cores * hyperthreads: 1*2*1

KMB_V0.53_MT_FPU.exe: 336.321 million iterations/sec

KMB_V0.53_MT_SSE2.exe:736.472 million iterations/sec

KMB_V0.57_2T_X1.exe:817.711 million iterations/sec

KMB_V0.57_2T_X2.exe:1207.928 million iterations/sec

KMB_V0.57_2T_X3.exe:1678.401 million iterations/sec

KMB_V0.57_2T_X4.exe:1967.415 million iterations/sec

KMB_V0.57_MT_X1.exe:807.112 million iterations/sec

KMB_V0.57_MT_X2.exe:1186.070 million iterations/sec

KMB_V0.57_MT_X3.exe:1645.114 million iterations/sec

KMB_V0.57_MT_X4.exe:1910.082 million iterations/sec

There is already a table available for the 32-bit form of the benchmark, but I think that some of the users of this forum have much better machines and I hope to see better results, especially with the 64-bit code!

xcingix · ‎04-27-2007

I'm looking for software that will allow me to run benchmarks on Vista 64 bit.

TimP · ‎04-28-2007

I don't think many people are going to run a binary DirectX benchmark of unknown origin and purpose.

I don't see how the first reply relates to the original post, nor does it stand on its own.

xorpd · ‎04-30-2007

Tim,

I am going to interpret your response as a question and attempt to provide some answers.

As to purpose of the benchmark, it attempts to demonstrate the throughput of the CPU by calculating the Mandelbrot set over a few regions. It seems to have OK speed compared to other benchmarks such as quickman or the shootout because these others don't have as many instruction streams and aren't threaded.

Addressing the origin question, the inner loops of the 32-bit code seem to be due toPeter Kankowski, and the authors seem to have also had some exchange with Intel regarding load-balancing among the threads. AMD also has a version where they talk about increasing the number of instruction streams. Paul Hsieh offers a few suggestions including one he forwards about only checking the loop exit condition every few iterations and backtracking on loop exit.The 32-bit code was my starting point; I translated it into 64-bit code, restructured the inner loops somewhat to get better latency or throughput or to increase the number of instruction streams.Also added more exit points than the original to avoid redundant calculations.

It's only a binary benchmark because although it only needs the freeware fasm to assemble, it requires some Microsoft DLLs that only seem to come as part of rather large downloads to link. Instructions for building the benchmark are included in the README.TXT file if that is the user's preference, however.

The DirectX is there to enable fast drawing of the Mandelbrot set images, this was present in the 32-bit version that I adapted and it seems to work OK except that it doesn't draw the images correctly in portrait mode for some reason.I don't know whether that is a bug in the code (it also happens in the 32-bit version) or if there are some software layers involved that aren't communicating with each other properly. When running the benchmarks from the slowest to the fastest you can see the difference in rendering speeds -- the faster ones seem to definitely have places to go and people to see, at least on my machine.

Thank you for taking the time to comment on my benchmark and please don't hesitate to ask for any further c larifications. -- Xorpd!

levicki · ‎05-10-2007

I can say from my personal experience that 64-bit code is always a bit slower unless it directly benefits from larger address space, 64-bit integer math or on Netburst CPUs from MSVC2005 compiler using SSE/SSE2 instead of FPU math.

Again, from my experience, same code tends to perform anywhere between 5% and 15% slower. This is mostly due to reduced instruction decoding bandwidth because of additional instruction prefixes needed to access additional 8 GPR and XMM registers.

So unless your code does not belong into above mentioned categories which actually benefit from going to x64, then you can consider yourself lucky if your 64-bit code slowness percentage is measured using only a single digit.

My personal results:

6.17 clocks per element for 32-bit assembler
6.57 clocks per element for 64-bit assembler
6.71 clocks per element for 32-bit C compiled with Intel Compiler
7.28 clocks per element for 64-bit C compiled with Intel Compiler

As you can see, even Intel Compiler can't avoid the slowdown I am mentioning. Of course, this will become visible to you only when you already have highly optimized piece of code like this one I have here.