Intel® Moderncode for Parallel Architectures
Support for developing parallel programming applications on Intel® Architecture.

Woodcrest performance

mcgettrs
Beginner
302 Views

Hi,

I have been running a number of Sparse Matrix by Vectorbenchmarks on different machines and the results are confusing me and i was wondering if anyone could shed a little light on the situation. My Code simply reads in a matrix and vector into main memory and performs matrix by vector calculation on it 500 times. The code is compiled with the intel compiler.

I ran this code on a pentium 4 processor with 800MHz front side bus and it managed to outperform a intel Woodcrest core 2 processor which claims to have a 1333MHz front side bus. These calculations are usually limited by the front side bus speed so the faster front side bus should perform better.

The Woodcrest i was using has 4 banks of ddr2 memory as i read that to achieve the fullmemory bandwidthin these processor you need at least 4 banks of matched ddr. The performance i am getting from the Woodcrest is about what i would expect from a processor with a Front side bus of 667MHz. The Program i am running is threaded and each thread has the same number of memory accesses and are working on independant arrays of data. The threaded version achieves the same memory bandwidth as the non threaded version.

I also ran the code on another core 2 processor (conroy i think) with a 1066FSB and it achieved the full memory bandwidth i would expect from 1066FSB.

Any ideas on what is happening would be greatly appreciated. I am not at this time looking for optimal performance. I just want to know why my penitum 4 can beat the Woodcrest server. If you want any other information don't hesitate to ask. Thank you so much for your time.

Seamas

0 Kudos
4 Replies
jimdempseyatthecove
Honored Contributor III
302 Views

This sounds like a memory placement issue. Cache interference due to sharing of identifiers.

Try this.

Allocate an array of the vectors you use to multiply with the matrix. Allocate sufficient numbers of these vectors to assure the memory consumed by array of vectors exceed that of the cache. Populate each vector in the array with the same initial values. Then run the timming testfor each vector in the array of vectors in performing the product with the matrix. See if the4 performance measurements is flat or saw toothed.

Jim Dempsey

0 Kudos
TimP
Honored Contributor III
302 Views
As Jim indicated, you should try to avoid cache interference between the 2 sockets on Woodcrest. In part, placing threads which use adjacent memory regions on the same cache (socket) should help. If you are using Intel compiler OpenMP, you should try the KMP_AFFINITY settings. If your compiler is over a year old and doesn't support this, upgrade.
If you use linux affinity tools such as taskset, or direct threading affinity calls, you must observe the usual scrambled numbering, where cores 0 and 2 would be one one socket, and 1 and 3 on the other, as opposed to the straight order you would see on an AMD or IA-64.
With KMP_AFFINITY, you can deal with the numbers directly, or use the compact or scatter setting.
In our experience, problems with poor affinity are more severe on Windows, and KMP_AFFINITY doesn't entirely solve them.
If you don't control thread placement, the snoop filter (on "Greencreek" models which have it) could present a partial solution to the problem. It absorbs some of the bus cycles in turn for avoiding repeated bus access when the same cache line is needed on both sockets.
0 Kudos
mcgettrs
Beginner
302 Views
Hey Jim & Tim,

Thank you so much for the replies. sorry about my delay getting back to you. I will implement some of those ideas now and let you know how it goes.
Just to clear up a few points my machine only has one processor package in it (1 dual core xeon processor 3GHz). It has a second socket but there is no processor in it. Could this cause any problems? Do you need the two sockets populated to achieve maximum memory bandwidth performance?

Thank you again for all your help,
Seamas
0 Kudos
TimP
Honored Contributor III
302 Views
Total memory bandwidth certainly increases with both sockets in use, in view of the dual buss architecture. With one socket, it's doubtful whether you have any better than a Core 2 Duo.
0 Kudos
Reply