I have been running a number of Sparse Matrix by Vectorbenchmarks on different machines and the results are confusing me and i was wondering if anyone could shed a little light on the situation. My Code simply reads in a matrix and vector into main memory and performs matrix by vector calculation on it 500 times. The code is compiled with the intel compiler.
I ran this code on a pentium 4 processor with 800MHz front side bus and it managed to outperform a intel Woodcrest core 2 processor which claims to have a 1333MHz front side bus. These calculations are usually limited by the front side bus speed so the faster front side bus should perform better.
The Woodcrest i was using has 4 banks of ddr2 memory as i read that to achieve the fullmemory bandwidthin these processor you need at least 4 banks of matched ddr. The performance i am getting from the Woodcrest is about what i would expect from a processor with a Front side bus of 667MHz. The Program i am running is threaded and each thread has the same number of memory accesses and are working on independant arrays of data. The threaded version achieves the same memory bandwidth as the non threaded version.
I also ran the code on another core 2 processor (conroy i think) with a 1066FSB and it achieved the full memory bandwidth i would expect from 1066FSB.
Any ideas on what is happening would be greatly appreciated. I am not at this time looking for optimal performance. I just want to know why my penitum 4 can beat the Woodcrest server. If you want any other information don't hesitate to ask. Thank you so much for your time.
This sounds like a memory placement issue. Cache interference due to sharing of identifiers.
Allocate an array of the vectors you use to multiply with the matrix. Allocate sufficient numbers of these vectors to assure the memory consumed by array of vectors exceed that of the cache. Populate each vector in the array with the same initial values. Then run the timming testfor each vector in the array of vectors in performing the product with the matrix. See if the4 performance measurements is flat or saw toothed.