Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Highlighted
Beginner
8 Views

Profile events on xeon by using perf

Hi,

I am doing some experiments using xeon and make a comparison between xeon and AMD, I am using perf in both machines. My concern is that the results of my events in xeon are thousand times higher then the results from AMD, but the runtime on xeon is much better than the AMD. I am measuring cache, instructions and cpu-clock in both machines.

My application is a matrix multiplication, size 1000x1000 and I am running a sequential execution, not parallel yet. 

 

Can you explain why these differences (thousand times) between the machines? for example, in AMD caches-references =   713,432 and xeon is  127,708,365 caches-references.

 

this is the xeon configuration 

  • Model: Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz
  • CPU MHz: 1200.000
  • CPU cores per Processor: 8
  • Host Physical Memory: 65933 MB
  • Architecture: x86_64
  • Host Physical Memory: 65933 MB
  • L1 dcache: 32K
  • L1 icache: 32K
  • L2 cache: 256K
  • L3 cache: 20480K
  • cache_alignment: 64

 

this is AMD configuration

 

AMD Opteron 2427

  • Instruction set: x86-64
  • Speed2.2 Ghz
  • L1 instruction cache6 x 64 Kb
  • L1 data cache6 x 64 Kb
  • L2 cache6 x 512 Kb
  • L3 cache6 Mb

 

 

take a look in this example,

 

 

 

=== results from AMD ======

perf stat -e cache-references,cache-misses,branch-instructions,cpu-clock bpsh 15 ./mm1 1000 1

Program runs in 17.52 seconds

 

 Performance counter stats for 'bpsh 15 ./mm1 1000 1':

 

           713,432      cache-references                                            

            35,538      cache-misses              #    4.981 % of all cache refs    

           411,916      branch-instructions                                         

          2.701875      cpu-clock (msec)                                            

 

      17.560428347 seconds time elapsed

 

=== results from Xeon ======

now, I compiled mm1 on xeon as offload, but there is no #pragma offload directive, so the code run intirely on xeon (processor)

 

perf stat -e cache-references,cache-misses,branch-instructions,cpu-clock ./mm1 1000 1

Program runs in 2.69 seconds

 

 Performance counter stats for './mm1 1000 1':

 

       127,708,365      cache-references                                            

           477,245      cache-misses              #    0.374 % of all cache refs    

       507,201,088      branch-instructions                                         

       2701.183114      cpu-clock (msec)                                            

 

       2.701594439 seconds time elapsed

 

 

do you have any idea why the results are so different?

 

 

thanks,

 

0 Kudos
0 Replies