- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi~ I think i might meet with some wired problem with Intel MKL.
I install visual studio 2013 and intel xe composer 2013 sp1 on two machines, one is pc with windows 8.1 equped with intel core i5-3470 3.2GHZ and the other is server with windows 2008 r2 sp1 and xeon e7330 2.4GHZ.
Then, i ran test program dgemm_threading_effect_example under C:\Program Files (x86)\Intel\Composer XE 2013 SP1\Samples\en_US\mkl\tutorials\mkl_mmx_c(I changed matrix size to 2000*2000*1000)
Under one thread, the running time is 1049 ms for server and 349 ms for PC. Such difference remaines as thread number is increased. I also compared one toolkit with neural network model with same setting and got same results. This really puzzles me as i am trying to run some experiments with MKL on server, which supports more threads.
So can anyone tell me the difference of the performance?
Thanks a lot!
YINGGONG.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
This shouldn't be a surprise. Intel Xeon E7330 is a very old product. Released in 2007, it was based on the "Core" architecture, even earlier than the "Nehalem" architecture. Intel Core i5-3470, on the other hand, is very recent. It was released in 2012. It is based on the "Ivy Bridge" architecture, which is 2-3 generations more advanced than the "Core" architecture. Besides, both CPUs have the same number of cores (4). The newer desktop CPU does have a significant performance advantage over the older server CPU.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for your help.
I tried the my own program(not MKL benchmark test) on another server, whose CPU is E7-4850 with 8 processors. The performance is similar as server with E7330.
#thread running-time
I5-3470 XEON-E7330
1 32s 113s
2 22s 90s
3 18s 66s
…
15 NA 43s
The performance with many threads under XEON is still worse than I5 even with one thread.
My question is which one is more important on MKL performance? The main frequence or number of theads.
Thanks a lot!
YINGGONG.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It really depends on the MKL functions you call and the problem sizes they operate on. If you are calling BLAS level 3 functions on large matrices, then I'd say more threads would help performance more than CPU frequency would. Also, don't forget other important factors, such as the instruction set (for example, AVX has 2x SIMD width than SSE), cache size, etc.
Yinggong Z. wrote:
Thanks for your help.
I tried the my own program(not MKL benchmark test) on another server, whose CPU is E7-4850 with 8 processors. The performance is similar as server with E7330.
#thread running-time
I5-3470 XEON-E7330
1 32s 113s
2 22s 90s
3 18s 66s
…
15 NA 43s
The performance with many threads under XEON is still worse than I5 even with one thread.
My question is which one is more important on MKL performance? The main frequence or number of theads.
Thanks a lot!
YINGGONG.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
As it was said Core i5 Ivy Bridge can easily outperform its older counterpart.One of the reason for faster performance could be usage of wider registers YMMn thus resulting in theoretically 2x faster vectorization.Port0 and Port1 of Ivy Bridge can issue per cycle one 256-bit fadd and one 256-bit fmul thus speeding up execution.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
On the multiple CPU platform you would likely need affinity settings to see full performance; as well as trying
KMP_AFFINITY=compact,verbose
you might try
KMP_AFFINITY=scatter,verbose
or specific settings for certain numbers of threads such
KMP_AFFINITY=compact,3,verbose for 2, 4, 8 threads
compact,1 for 16 threads
I'm suggesting examining verbose output since the CPU was off support before development and testing of your compiler version was begun. Some strange numberings of cores in BIOS were used back then; possibly it's scrambled so that compact should give good performance up to 4 threads (and that may be all that odd platform was good for).

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page