I am doing some experiments using xeon and make a comparison between xeon and AMD, I am using perf in both machines. My concern is that the results of my events in xeon are thousand times higher then the results from AMD, but the runtime on xeon is much better than the AMD. I am measuring cache, instructions and cpu-clock in both machines.
My application is a matrix multiplication, size 1000x1000 and I am running a sequential execution, not parallel yet.
Can you explain why these differences (thousand times) between the machines? for example, in AMD caches-references = 713,432 and xeon is 127,708,365 caches-references.
this is the xeon configuration
this is AMD configuration
AMD Opteron 2427
take a look in this example,
=== results from AMD ======
perf stat -e cache-references,cache-misses,branch-instructions,cpu-clock bpsh 15 ./mm1 1000 1
Program runs in 17.52 seconds
Performance counter stats for 'bpsh 15 ./mm1 1000 1':
35,538 cache-misses # 4.981 % of all cache refs
2.701875 cpu-clock (msec)
17.560428347 seconds time elapsed
=== results from Xeon ======
now, I compiled mm1 on xeon as offload, but there is no #pragma offload directive, so the code run intirely on xeon (processor)
perf stat -e cache-references,cache-misses,branch-instructions,cpu-clock ./mm1 1000 1
Program runs in 2.69 seconds
Performance counter stats for './mm1 1000 1':
477,245 cache-misses # 0.374 % of all cache refs
2701.183114 cpu-clock (msec)
2.701594439 seconds time elapsed
do you have any idea why the results are so different?