I have been running a number of Sparse Matrix by Vectorbenchmarks on different machines and the results are confusing me and i was wondering if anyone could shed a little light on the situation. My Code simply reads in a matrix and vector into main memory and performs matrix by vector calculation on it 500 times. The code is compiled with the intel compiler.
I ran this code on a pentium 4 processor with 800MHz front side bus and it managed to outperform a intel Woodcrest core 2 processor which claims to have a 1333MHz front side bus. These calculations are usually limited by the front side bus speed so the faster front side bus should perform better.
The Woodcrest i was using has 4 banks of ddr2 memory as i read that to achieve the fullmemory bandwidthin these processor you need at least 4 banks of matched ddr. The performance i am getting from the Woodcrest is about what i would expect from a processor with a Front side bus of 667MHz. The Program i am running is threaded and each thread has the same number of memory accesses and are working on independant arrays of data. The threaded version achieves the same memory bandwidth as the non threaded version.
I also ran the code on another core 2 processor (conroy i think) with a 1066FSB and it achieved the full memory bandwidth i would expect from 1066FSB.
Any ideas on what is happening would be greatly appreciated. I am not at this time looking for optimal performance. I just want to know why my penitum 4 can beat the Woodcrest server. If you want any other information don't hesitate to ask. Thank you so much for your time.
This sounds like a memory placement issue. Cache interference due to sharing of identifiers.
Allocate an array of the vectors you use to multiply with the matrix. Allocate sufficient numbers of these vectors to assure the memory consumed by array of vectors exceed that of the cache. Populate each vector in the array with the same initial values. Then run the timming testfor each vector in the array of vectors in performing the product with the matrix. See if the4 performance measurements is flat or saw toothed.
If you use linux affinity tools such as taskset, or direct threading affinity calls, you must observe the usual scrambled numbering, where cores 0 and 2 would be one one socket, and 1 and 3 on the other, as opposed to the straight order you would see on an AMD or IA-64.
With KMP_AFFINITY, you can deal with the numbers directly, or use the compact or scatter setting.
In our experience, problems with poor affinity are more severe on Windows, and KMP_AFFINITY doesn't entirely solve them.
If you don't control thread placement, the snoop filter (on "Greencreek" models which have it) could present a partial solution to the problem. It absorbs some of the bus cycles in turn for avoiding repeated bus access when the same cache line is needed on both sockets.
Thank you so much for the replies. sorry about my delay getting back to you. I will implement some of those ideas now and let you know how it goes.
Just to clear up a few points my machine only has one processor package in it (1 dual core xeon processor 3GHz). It has a second socket but there is no processor in it. Could this cause any problems? Do you need the two sockets populated to achieve maximum memory bandwidth performance?
Thank you again for all your help,