Intel® oneAPI Math Kernel Library
Ask questions and share information with other developers who use Intel® Math Kernel Library.

Performance difference bewteen i7 and xeon CPU for intel mkl

Yinggong_Z_
Beginner
730 Views

Hi~ I think i might meet with some wired problem with Intel MKL.

I install visual studio 2013 and intel xe composer 2013 sp1 on two machines, one is pc with windows 8.1 equped with intel core i5-3470 3.2GHZ and the other is server with windows 2008 r2 sp1 and xeon e7330 2.4GHZ.

Then, i ran test program dgemm_threading_effect_example under C:\Program Files (x86)\Intel\Composer XE 2013 SP1\Samples\en_US\mkl\tutorials\mkl_mmx_c(I changed matrix size to 2000*2000*1000)

Under one thread, the running time is 1049 ms for server and 349 ms for PC. Such difference remaines as thread number is increased. I also compared one toolkit with neural network model with same setting and got same results. This really puzzles me as i am trying to run some experiments with MKL on server, which supports more threads.

So can anyone tell me the difference of the performance?

Thanks a lot!

YINGGONG.

 

0 Kudos
5 Replies
Zhang_Z_Intel
Employee
730 Views

This shouldn't be a surprise. Intel Xeon E7330 is a very old product. Released in 2007, it was based on the "Core" architecture, even earlier than the "Nehalem" architecture. Intel Core i5-3470, on the other hand, is very recent. It was released in 2012. It is based on the "Ivy Bridge" architecture, which is 2-3 generations more advanced than the "Core" architecture. Besides, both CPUs have the same number of cores (4). The newer desktop CPU does have a significant performance advantage over the older server CPU.

0 Kudos
Yinggong_Z_
Beginner
730 Views

Thanks for your help.

I tried the my own program(not MKL benchmark test) on another server, whose CPU is E7-4850 with 8 processors. The performance is similar as server with E7330.

#thread          running-time

                I5-3470      XEON-E7330

1                32s            113s

2                22s            90s

3                18s            66s

15              NA             43s

 

The performance with many threads under XEON is still worse than I5 even with one thread.

My question is which one is more important on MKL performance? The main frequence or number of theads.

Thanks a lot!

YINGGONG.

0 Kudos
Zhang_Z_Intel
Employee
730 Views

It really depends on the MKL functions you call and the problem sizes they operate on. If you are calling BLAS level 3 functions on large matrices, then I'd say more threads would help performance more than CPU frequency would. Also, don't forget other important factors, such as the instruction set (for example, AVX has 2x SIMD width than SSE), cache size, etc.

Yinggong Z. wrote:

Thanks for your help.

I tried the my own program(not MKL benchmark test) on another server, whose CPU is E7-4850 with 8 processors. The performance is similar as server with E7330.

#thread          running-time

                I5-3470      XEON-E7330

1                32s            113s

2                22s            90s

3                18s            66s

15              NA             43s

 

The performance with many threads under XEON is still worse than I5 even with one thread.

My question is which one is more important on MKL performance? The main frequence or number of theads.

Thanks a lot!

YINGGONG.

0 Kudos
Bernard
Valued Contributor I
730 Views

As it was said Core i5 Ivy Bridge can easily outperform its older counterpart.One of the reason for faster performance could be usage of wider registers YMMn thus resulting in theoretically 2x faster vectorization.Port0 and Port1 of Ivy Bridge can issue per cycle one 256-bit fadd and one 256-bit fmul thus speeding up execution.

0 Kudos
TimP
Honored Contributor III
730 Views

On the multiple CPU platform you would likely need affinity settings to see full performance; as well as trying

KMP_AFFINITY=compact,verbose

you might try

KMP_AFFINITY=scatter,verbose

or specific settings for certain numbers of threads such

KMP_AFFINITY=compact,3,verbose for 2, 4, 8 threads

compact,1 for 16 threads

I'm suggesting examining verbose output since the CPU was off support before development and testing of your compiler version was begun.  Some strange numberings of cores in BIOS were used back then; possibly it's scrambled so that compact should give good performance up to 4 threads (and that may be all that odd platform was good for).

0 Kudos
Reply