- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I have strange problem where the same program runs slower on a i5-3320M than an Xeon w3680. The main part of the code is FFT. Could anyone shine some light on me? Thanks!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I've spent a fair amount of time posting about why the new 2-core CPU might match performance of the old 6-core, but my posts all vanished.
1. An MKL optimized for AVX could achieve double the performance per thread of the older CPU.
2. The older CPU may be as dependent (or more so) on 32-byte data alignment, particularly as you are probably attempting more threads.
3. The 6-core Westmere is likely to be more dependent on optimizing NUM_THREADS and affinity. Although MKL should attempt automatically to use 1 thread per core in spite of HyperThreading, I haven 't seen any software which deals automatically with the asymmetric arrangement of Westmere 6-core. Under the usual BIOS arrangement where cores 0 and 1 share cache access, likewise cores 2 and 3, while cores 4 and 5 don't share paths to cache, try something such as
set OMP_NUM_THREADS=4
set KMP_AFFINITY="proclist=[3,7,9,11],explicit,verbose"
(if you disabled HT, [1,3,4,5] would use the same cores as the above settings for HT enabled)
verbose is to get the confirmation of affinity settings echoed to the screen.
By tacking on 2 additional cores, WSM 6-core typically gains more than 20% performance over the equivalent 4-core CPU at the same clock speed, provided that the affinity requirements are observed. You would have to read the ads carefully to see that 50% gain isn't expected even for applications with good threaded scaling. Many of the customers who took the trouble to understand the situation but didn't want to deal with the special affinities chose to buy the 4-core model.
Among the advantages of the newer CPUs are less arcane affinity requirements (although you might not feel that way about Intel(r) Xeon Phi(tm))
Perhaps you didn't find the web search references edifying, but you didn't even mention what you didn't understand. I noticed that Google doesn't return nearly as many references as Bing or Yahoo.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Take a look at the comparison of these 2 processors here: http://ark.intel.com/compare/47917,64896
i5-3320M processor is a much newer generation of architectures than Xeon W3680. However Xeon W3680 has 6 CPU cores while i5-3320M has only 2. Cache size on Xeon W3680 is 4x the cache size of i5-3320M. It shouldn't be a surprise that the old server CPU can outperform the newer desktop CPU for some workloads.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>>The main part of the code is FFT. Could anyone shine some light on me? Thanks!>>>
As @Zhang Z hinted probably older Xeon makes a better use of available cores( more execution units) and larger cache.
Btw @Bo Q your thread's title states that Xeon runs slower than i5.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I've spent a fair amount of time posting about why the new 2-core CPU might match performance of the old 6-core, but my posts all vanished.
1. An MKL optimized for AVX could achieve double the performance per thread of the older CPU.
2. The older CPU may be as dependent (or more so) on 32-byte data alignment, particularly as you are probably attempting more threads.
3. The 6-core Westmere is likely to be more dependent on optimizing NUM_THREADS and affinity. Although MKL should attempt automatically to use 1 thread per core in spite of HyperThreading, I haven 't seen any software which deals automatically with the asymmetric arrangement of Westmere 6-core. Under the usual BIOS arrangement where cores 0 and 1 share cache access, likewise cores 2 and 3, while cores 4 and 5 don't share paths to cache, try something such as
set OMP_NUM_THREADS=4
set KMP_AFFINITY="proclist=[3,7,9,11],explicit,verbose"
(if you disabled HT, [1,3,4,5] would use the same cores as the above settings for HT enabled)
verbose is to get the confirmation of affinity settings echoed to the screen.
By tacking on 2 additional cores, WSM 6-core typically gains more than 20% performance over the equivalent 4-core CPU at the same clock speed, provided that the affinity requirements are observed. You would have to read the ads carefully to see that 50% gain isn't expected even for applications with good threaded scaling. Many of the customers who took the trouble to understand the situation but didn't want to deal with the special affinities chose to buy the 4-core model.
Among the advantages of the newer CPUs are less arcane affinity requirements (although you might not feel that way about Intel(r) Xeon Phi(tm))
Perhaps you didn't find the web search references edifying, but you didn't even mention what you didn't understand. I noticed that Google doesn't return nearly as many references as Bing or Yahoo.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>> An MKL optimized for AVX could achieve double the performance per thread of the older CPU.>>>
Forgotten to mention this in my post.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks all replies are very helpful! Also, I found setting /QxHost option seems to help as well.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
On the Westmere, /QxHost was not always as good as /arch:SSE4.1, but these options will not influence MKL.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page