Strange perfomance variations on Xeon with hyperthreading (MMX/SSE2)
I have optimized implementation of string alignment algorithm. I am using SSE2 intructions heavily that gives me on average 7 fold speedup. There is about 32 kb read only buffer thatis shared among multiple threads. Each thread requires individual read/write bufferof at most128Kb. Each thread also uses individual read only buffer with the size of about 64Kb.
Now, the numbers I get:
Higher number better (linear to the time)
Single Xeon 2.8 box with HT enabled:
single threadperfomance 312
two thread perfomance: 604
Dual Xeon 2.8 box with HT enabled:
single thread perfomance: 295
dual thread perfomance: 500
three threads: 445
four threads: 390
Single P4 3.0 Box 2Mb cache with HT delivers:
single threadperfomance: 295
two thread perfomance: 500
four thread pefomance: 460
My 1.7 Centrino laptop:
single thread: 439
two threads: 425
Is there good explanation?
I am buffled. Something to do with cache? Please help me to understand it.