- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
we have some dual intel xeon quad core 5450 machines(8 core total) running Centos 4.5 for hpc. These machines have Infiniband so network isn't a bottleneck. We run several tests with Fluid Dynamics Software, using 64 cores (8 machines 8 cores or 16 machines 4 cores) and check that when we use 4 cores per machine the performance in time is better (2x faster) then 8 cores per machine. Does anybody know why this happen?
Best regards
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I agree with Jim that you should try to do some analysis of the code, but it's not as clear to me that the problem you're seeing is due to L2 saturation. How big a fluid volume are you using for your calculations? (Will it even fit in L2?) Is the volume shared across all eight or sixteen machines? If two cores per bus are close to saturation, assuming optimal scheduling of threads to cores, then running four cores per bus might well halve performance.
The Intel CoreTM i7 processor offers integrated memory controllers per socket, the Intel Quickpath interconnect and faster memory, all which might benefit a memory-bound application.
Thanks for the answer Mr. Reed,
i can't analyse the code because it is a proprietary code and i don't have access to the input data too. I think that the volume is shared across all the machines.
We have been doing another tests with a second app, but at this time, the performance with 4 threads has the same time that the performance with 6 or 8 threads. Summarizing, we have 2 apps
App1 - Fluid Dynamics:
- we use hpmpi 2.3
- the time with 4 threads is 2x faster then 8 threads
App 2 - Liquid Migration:
- we use openmpi 1.2.6
- the time with 4 threads is the same that with 6 or 8 threads
We are now trying to learn how to use vtune to identify the problem. It could be a saturation problem ? A limitation of the Xeon core 2 quad 5450?
Thanks
Intel Xeon 5xxx series machine has 2 die, meaning 4 core per die, so in total 5450 has 8 cores. Perform "less /proc/cpuinfo" to know it details.
You qouted "i can't analyse the code because it is a proprietary code and i don't have access to the input data too. I think that the volume is shared across all the machines."
Intel VTune does perform analysis for binary code, please check the Getting_Starter Guide. This would give an impression about the flow. To analyze in terms of MPI, you have to install VTune for selective nodes (here from your question, it seems that you have 8 nodes (or m/c.). Atleast, you can check the behaviour for L2 cache and than conclude.
You qouted "could the memory access be the bottleneck", yes certainly old Intel Xeon Processor uses FSB concept but new processor (Nehalem) uses QPI which is better than old concepts of FSB. Apart from QPI, Nehalem uses SMT parallelism which is much better than Intel old processor which uses TLP & ILP but TLP cannot exploit ILP properly in old Intel Xeon processor, but with Nehalem SMT parallelism, TLP can exploit ILP efficiently, moreover Nehalem each core supports 2 threads per core. So, by having all these extra/new parameters, Nehalem should give better performance & scalability in terms of old Intel Xeon 5xxx processors. You can try comparing for better understanding.
You qouted "the performance with 4 threads has the same time that the performance with 6 or 8 threads", try analyzing thread & MPI behaviour using Intel Trace Analyzer & Collector. Intel Trace Analyzer & Collector only supports HP-MPI, Intel-MPI & MVAPICH but not OpenMPI.
You qouted "We are now trying to learn how to use vtune to identify the problem. It could be a saturation problem", do you mean BUS Saturation, try running EBS with "Advance Performance Tuning" events button. Try performing Bus Saturation if it's there with following EBS events -
- CPU_CLK_UNHALTED.CORE
- L2_LINES_IN.SELF.ANY
- INST_RETIRED.ANY
- MEM_LOAD_RETIRED.L2_LINE_MISS
- L2_LINES_IN.SELF.PREFETCH
- BUS_DRDY_CLOCKS.ALL_AGENTS
- BUS_TRANS_ANY.ALL_AGENTS
- CPU_CLK_UNHALTED.BUS
I think by now, you should have fairly clear picture what you can try doing.
~BR
Mukkaysh Srivastav
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The X5400 has two L2 caches. Two of the cores share one L2, the other two cores share the other L2.In the event that each core can saturate the L2 cache to which itis attached you would then expect 1x faster (same speed). However you are experiencing 2x faster. There are two things that could account for this a) cache eviction, b) excessively long interlock contention (for spinlock or other atomic operations).
For the case a) (cache eviction) see if you can reduce the immediate working set (partition you data into smaller pieces). For this processor (and 4 cores running)the immediate data set should not exceed 3MB (half of either of the two 6MB L2 caches).
For case b) see if you can reduce the number of interlocked operations (generally by partitioning of the work and/or use of additional temporaries for use as reduction variables).
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks in advance!
Quoting - jimdempseyatthecove
The X5400 has two L2 caches. Two of the cores share one L2, the other two cores share the other L2.In the event that each core can saturate the L2 cache to which itis attached you would then expect 1x faster (same speed). However you are experiencing 2x faster. There are two things that could account for this a) cache eviction, b) excessively long interlock contention (for spinlock or other atomic operations).
For the case a) (cache eviction) see if you can reduce the immediate working set (partition you data into smaller pieces). For this processor (and 4 cores running)the immediate data set should not exceed 3MB (half of either of the two 6MB L2 caches).
For case b) see if you can reduce the number of interlocked operations (generally by partitioning of the work and/or use of additional temporaries for use as reduction variables).
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I suggest you examine the cause of the performance issue using TProf and/or VTune and/or PTU.
Check to see if your application had auto tuning capability whereby it would figure out how large the L2 cache was and then optimally tune the application for that size cache under the assumption of 1 core per L2 cache which was the case for older processor models (and some of the next gen processors as well) but is not the case for yur XZ5400 series. This should be a relatively easy thing to check out.
If there are hard wired parameters, then if possible, try halving a parameter that would reduce an immediate working set by a factor of 2.
If this were an FSB problem, my guess is the application performance would plateau. This is not the case as you reported. In your first post you stated to the effect that performance dropped by a factor of 2. This would seem to indicate cache eviction (which would cause you to encounter higher FSB activity).
While the new processor _may_ speed things along, it would not fix an under-laying problem, that if identified now, and fixed now, would improve the4 core run performance by a factor of 4 or so (giving you net 2x over 2 cores). This performance gain, once attained, will carry forward to some degree to the newer architecture.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Check to see if your application had auto tuning capability whereby it would figure out how large the L2 cache was and then optimally tune the application for that size cache under the assumption of 1 core per L2 cache which was the case for older processor models (and some of the next gen processors as well) but is not the case for yur XZ5400 series. This should be a relatively easy thing to check out.
I agree with Jim that you should try to do some analysis of the code, but it's not as clear to me that the problem you're seeing is due to L2 saturation. How big a fluid volume are you using for your calculations? (Will it even fit in L2?)Is the volume shared across all eight orsixteen machines? If two cores per bus are close to saturation, assuming optimal scheduling of threads to cores, then running four cores per bus might well halve performance.
The IntelCoreTMi7 processor offers integrated memory controllers per socket, the Intel Quickpath interconnectand faster memory, all which might benefit a memory-bound application.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I agree with Jim that you should try to do some analysis of the code, but it's not as clear to me that the problem you're seeing is due to L2 saturation. How big a fluid volume are you using for your calculations? (Will it even fit in L2?)Is the volume shared across all eight orsixteen machines? If two cores per bus are close to saturation, assuming optimal scheduling of threads to cores, then running four cores per bus might well halve performance.
The IntelCoreTM i7 processor offers integrated memory controllers per socket, the Intel Quickpath interconnectand faster memory, all which might benefit a memory-bound application.
Thanks for the answer Mr. Reed,
i can't analyse the code because it is a proprietary code and i don't have access to the input data too. I think that the volume is shared across all the machines.
We have been doing another tests with a second app, but at this time, the performance with 4 threads has the same time that the performance with 6 or 8 threads. Summarizing, we have 2 apps
App1 - Fluid Dynamics:
- we use hpmpi 2.3
- the time with 4 threads is 2x faster then 8 threads
App 2 - Liquid Migration:
- we use openmpi 1.2.6
- the time with 4 threads is the same that with 6 or 8 threads
We are now trying to learn how to use vtune to identify the problem. It could be a saturation problem ? A limitation of the Xeon core 2 quad 5450?
Thanks
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
With hp-mpi, of course, you should be using the -cpu_bind:MAPcpu table to unscramble the core numbering. If you let the affinity float, you won't get repeatable results under VTune, but still you won't see the best or worst. Intel MPI has built-in understanding of the core numbering sequence for Intel CPUs. If openmpi expects you to use taskset explicitly, again you will need to specify the core number table.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I agree with Jim that you should try to do some analysis of the code, but it's not as clear to me that the problem you're seeing is due to L2 saturation. How big a fluid volume are you using for your calculations? (Will it even fit in L2?) Is the volume shared across all eight or sixteen machines? If two cores per bus are close to saturation, assuming optimal scheduling of threads to cores, then running four cores per bus might well halve performance.
The Intel CoreTM i7 processor offers integrated memory controllers per socket, the Intel Quickpath interconnect and faster memory, all which might benefit a memory-bound application.
Thanks for the answer Mr. Reed,
i can't analyse the code because it is a proprietary code and i don't have access to the input data too. I think that the volume is shared across all the machines.
We have been doing another tests with a second app, but at this time, the performance with 4 threads has the same time that the performance with 6 or 8 threads. Summarizing, we have 2 apps
App1 - Fluid Dynamics:
- we use hpmpi 2.3
- the time with 4 threads is 2x faster then 8 threads
App 2 - Liquid Migration:
- we use openmpi 1.2.6
- the time with 4 threads is the same that with 6 or 8 threads
We are now trying to learn how to use vtune to identify the problem. It could be a saturation problem ? A limitation of the Xeon core 2 quad 5450?
Thanks
Intel Xeon 5xxx series machine has 2 die, meaning 4 core per die, so in total 5450 has 8 cores. Perform "less /proc/cpuinfo" to know it details.
You qouted "i can't analyse the code because it is a proprietary code and i don't have access to the input data too. I think that the volume is shared across all the machines."
Intel VTune does perform analysis for binary code, please check the Getting_Starter Guide. This would give an impression about the flow. To analyze in terms of MPI, you have to install VTune for selective nodes (here from your question, it seems that you have 8 nodes (or m/c.). Atleast, you can check the behaviour for L2 cache and than conclude.
You qouted "could the memory access be the bottleneck", yes certainly old Intel Xeon Processor uses FSB concept but new processor (Nehalem) uses QPI which is better than old concepts of FSB. Apart from QPI, Nehalem uses SMT parallelism which is much better than Intel old processor which uses TLP & ILP but TLP cannot exploit ILP properly in old Intel Xeon processor, but with Nehalem SMT parallelism, TLP can exploit ILP efficiently, moreover Nehalem each core supports 2 threads per core. So, by having all these extra/new parameters, Nehalem should give better performance & scalability in terms of old Intel Xeon 5xxx processors. You can try comparing for better understanding.
You qouted "the performance with 4 threads has the same time that the performance with 6 or 8 threads", try analyzing thread & MPI behaviour using Intel Trace Analyzer & Collector. Intel Trace Analyzer & Collector only supports HP-MPI, Intel-MPI & MVAPICH but not OpenMPI.
You qouted "We are now trying to learn how to use vtune to identify the problem. It could be a saturation problem", do you mean BUS Saturation, try running EBS with "Advance Performance Tuning" events button. Try performing Bus Saturation if it's there with following EBS events -
- CPU_CLK_UNHALTED.CORE
- L2_LINES_IN.SELF.ANY
- INST_RETIRED.ANY
- MEM_LOAD_RETIRED.L2_LINE_MISS
- L2_LINES_IN.SELF.PREFETCH
- BUS_DRDY_CLOCKS.ALL_AGENTS
- BUS_TRANS_ANY.ALL_AGENTS
- CPU_CLK_UNHALTED.BUS
I think by now, you should have fairly clear picture what you can try doing.
~BR
Mukkaysh Srivastav
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Do all those features (especially integrated memory controller) really boost i7 performance over previous core generation any notably. And can you tell me if all those features are also present in mobile cpu i7 line? These features sound like they could bring some more performance over previous generation than we were used to in the past.
www.pcterritory.net
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The Core 2 mobile CPU will continue as top of that line for a while longer, pending development of 32nm CPUs.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
http://www.reghardware.co.uk/2009/04/01/review_cpu_intel_xeon_5500/page6.html
Check this out, the dual w5580's are slower than a single one.
Does anybody have a rational explaination?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I read in the ExtremeTech review that the object of your benchmark is to predict graphics performance of future games on the platforms for which the benchmark is written, not, as you have done, to evaluate CPU performance on a platform introduced after it was written.
I don't know how you could relate the situation on Windows, where you presumably have drivers for the video card and DirectX, to linux, where there is no DirectX, and you are lucky if there is even VESA support for the fancy graphics card. Or, maybe you didn't have the right driver for your card.
If your point is that the Xeon 5500 series wasn't designed to compete in the gaming business with the Core i7, I guess you can define that as "a serious issue."
If you go looking for ways to write a benchmark which runs slower on a dual socket than a single socket machine, you'll surely find them.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi everybody,
i ran a some test with the fluid dynamics problem using the vtune and i got the result below:
Mac/threads | Process | Bus Utilization | Clocks per Instructions Retired - CPI | DTLB Miss Rate | Data Bus Utilization | ITLB Miss Rate | L1 Data Cache Miss Performance Impact | L1 Data Cache Miss Rate | L2 Cache Demand Miss Rate | L2 Cache Miss Rate | L2 Modified Lines Eviction Rate | Locked Operations Impact | TLB miss penalty | CPU_CLK_UNHALTED.CORE % | INST_RETIRED.ANY % | DTLB_MISSES.ANY % | ITLB_MISS_RETIRED % | PAGE_WALKS.CYCLES % | BUS_TRANS_ANY.ALL_AGENTS % | CPU_CLK_UNHALTED.BUS % | BUS_DRDY_CLOCKS.ALL_AGENTS % | L1D_REPL % | L2_LINES_IN.SELF.DEMAND % | L2_LINES_IN.SELF.ANY % | L2_M_LINES_OUT.SELF.ANY % | L1D_CACHE_LOCK.MESI % | L1D_CACHE_LOCK_DURATION % |
2Machines 6threads(3perM) | solver-hpmpi.exe | 45.56% | 0.832 | 0.000 | 19.64% | 0.000 | 13.83% | 0.014 | 0.001 | 0.006 | 0.002 | 0.11% | 0.55% | 96.86% | 99.11% | 93.50% | 73.84% | 92.57% | 39.06% | 96.89% | 41.40% | 99.49% | 97.18% | 99.43% | 99.56% | 63.92% | 42.37% |
2Mac 8threads(4perM) | solver-hpmpi.exe | 51.02% | 0.843 | 0.000 | 20.64% | 0.000 | 12.42% | 0.013 | 0.001 | 0.005 | 0.002 | 0.11% | 0.55% | 97.79% | 99.32% | 94.19% | 75.00% | 93.97% | 49.89% | 97.82% | 49.84% | 99.60% | 98.06% | 99.41% | 99.61% | 71.27% | 51.38% |
2Mac 12threads(6perM) | solver-hpmpi.exe | 61.96% | 0.876 | 0.000 | 23.36% | 0.000 | 8.95% | 0.010 | 0.001 | 0.004 | 0.002 | 0.10% | 0.63% | 98.38% | 99.28% | 92.44% | 39.29% | 94.37% | 75.10% | 98.73% | 74.83% | 99.64% | 99.54% | 99.58% | 99.67% | 65.74% | 61.45% |
2M 16threads(8perM) | solver-hpmpi.exe | 64.79% | 0.960 | 0.000 | 25.85% | 0.000 | 7.13% | 0.009 | 0.001 | 0.004 | 0.001 | 0.09% | 0.67% | 99.30% | 99.63% | 96.13% | 77.57% | 94.32% | 99.38% | 99.34% | 99.39% | 99.64% | 98.94% | 99.49% | 99.49% | 83.01% | 71.16% |
2M 24threads(12perM) | solver-hpmpi.exe | 2.81% | 1.591 | 0.004 | 1.95% | 0.003 | 0.93% | 0.002 | 0.000 | 0.000 | 0.000 | 1.94% | 4.69% | 99.51% | 99.38% | 99.77% | 99.97% | 99.80% | 95.37% | 99.55% | 97.25% | 97.25% | 87.96% | 95.43% | 97.46% | 99.22% | 94.78% |
1M 3threads | solver-hpmpi.exe | 49.46% | 0.829 | 0.000 | 21.01% | 0.000 | 13.73% | 0.014 | 0.001 | 0.006 | 0.002 | 0.10% | 0.59% | 96.97% | 99.16% | 93.98% | 70.59% | 93.31% | 38.99% | 97.06% | 41.48% | 99.50% | 98.21% | 99.44% | 99.63% | 63.35% | 43.33% |
1M 4threads | solver-hpmpi.exe | 56.72% | 0.910 | 0.000 | 22.77% | 0.000 | 12.77% | 0.015 | 0.001 | 0.006 | 0.002 | 0.06% | 0.60% | 97.85% | 99.32% | 93.46% | 75.33% | 95.44% | 49.86% | 49.86% | 97.79% | 99.60% | 97.85% | 99.50% | 99.46% | 48.03% | 42.09% |
1M 6threads | solver-hpmpi.exe | 65.39% | 0.976 | 0.000 | 26.33% | 0.000 | 8.59% | 0.010 | 0.001 | 0.005 | 0.002 | 0.06% | 0.66% | 98.96% | 99.54% | 96.77% | 75.82% | 95.74% | 75.19% | 98.93% | 75.09% | 99.63% | 99.08% | 99.60% | 99.59% | 59.28% | 64.92% |
1M 8threads | solver-hpmpi.exe | 75.74% | 1.254 | 0.000 | 29.86% | 0.000 | 7.65% | 0.012 | 0.001 | 0.005 | 0.002 | 0.33% | 0.79% | 99.44% | 99.57% | 97.07% | 82.87% | 95.98% | 99.56% | 99.46% | 99.58% | 99.71% | 99.32% | 99.71% | 99.64% | 97.13% | 93.32% |
1M 12threads | solver-hpmpi.exe | 20.77% | 1.628 | 0.005 | 1.19% | 0.003 | 2.65% | 0.005 | 0.000 | 0.003 | 0.001 | 3.96% | 3.51% | 99.60% | 99.40% | 99.79% | 99.95% | 99.54% | 99.27% | 99.58% | 96.75% | 56.73% | 27.64% | 99.16% | 99.40% | 95.11% | 77.19% |
Analysing the results, i think that there is no cache eviction, nor excessively interlock contention and it seems that the bus is not overloaded either.
I can not figure out why the bus utilization decrease when i use more then 8 threads per machine.
Can anyone explain these numbers ?
Thanks!
Quoting - zbestzeus
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page