Intel® Moderncode for Parallel Architectures
Support for developing parallel programming applications on Intel® Architecture.
1697 Discussions

Is Intel Xeon Quad Core a real quad core processor ?

tiagosazevedo
Beginner
1,220 Views
Hi,

we have some dual intel xeon quad core 5450 machines(8 core total) running Centos 4.5 for hpc. These machines have Infiniband so network isn't a bottleneck. We run several tests with Fluid Dynamics Software, using 64 cores (8 machines 8 cores or 16 machines 4 cores) and check that when we use 4 cores per machine the performance in time is better (2x faster) then 8 cores per machine. Does anybody know why this happen?

Best regards
0 Kudos
1 Solution
srimks
New Contributor II
1,220 Views
Quoting - tiagosazevedo

I agree with Jim that you should try to do some analysis of the code, but it's not as clear to me that the problem you're seeing is due to L2 saturation. How big a fluid volume are you using for your calculations? (Will it even fit in L2?) Is the volume shared across all eight or sixteen machines? If two cores per bus are close to saturation, assuming optimal scheduling of threads to cores, then running four cores per bus might well halve performance.

The Intel CoreTM i7 processor offers integrated memory controllers per socket, the Intel Quickpath interconnect and faster memory, all which might benefit a memory-bound application.


Thanks for the answer Mr. Reed,

i can't analyse the code because it is a proprietary code and i don't have access to the input data too. I think that the volume is shared across all the machines.

We have been doing another tests with a second app, but at this time, the performance with 4 threads has the same time that the performance with 6 or 8 threads. Summarizing, we have 2 apps


App1 - Fluid Dynamics:

- we use hpmpi 2.3
- the time with 4 threads is 2x faster then 8 threads


App 2 - Liquid Migration:

- we use openmpi 1.2.6
- the time with 4 threads is the same that with 6 or 8 threads


We are now trying to learn how to use vtune to identify the problem. It could be a saturation problem ? A limitation of the Xeon core 2 quad 5450?


Thanks

Intel Xeon 5xxx series machine has 2 die, meaning 4 core per die, so in total 5450 has 8 cores. Perform "less /proc/cpuinfo" to know it details.

You qouted "i can't analyse the code because it is a proprietary code and i don't have access to the input data too. I think that the volume is shared across all the machines."

Intel VTune does perform analysis for binary code, please check the Getting_Starter Guide. This would give an impression about the flow. To analyze in terms of MPI, you have to install VTune for selective nodes (here from your question, it seems that you have 8 nodes (or m/c.). Atleast, you can check the behaviour for L2 cache and than conclude.

You qouted "could the memory access be the bottleneck", yes certainly old Intel Xeon Processor uses FSB concept but new processor (Nehalem) uses QPI which is better than old concepts of FSB. Apart from QPI, Nehalem uses SMT parallelism which is much better than Intel old processor which uses TLP & ILP but TLP cannot exploit ILP properly in old Intel Xeon processor, but with Nehalem SMT parallelism, TLP can exploit ILP efficiently, moreover Nehalem each core supports 2 threads per core. So, by having all these extra/new parameters, Nehalem should give better performance & scalability in terms of old Intel Xeon 5xxx processors. You can try comparing for better understanding.

You qouted "the performance with 4 threads has the same time that the performance with 6 or 8 threads", try analyzing thread & MPI behaviour using Intel Trace Analyzer & Collector. Intel Trace Analyzer & Collector only supports HP-MPI, Intel-MPI & MVAPICH but not OpenMPI.

You qouted "We are now trying to learn how to use vtune to identify the problem. It could be a saturation problem", do you mean BUS Saturation, try running EBS with "Advance Performance Tuning" events button. Try performing Bus Saturation if it's there with following EBS events -

  1. CPU_CLK_UNHALTED.CORE
  2. L2_LINES_IN.SELF.ANY
  3. INST_RETIRED.ANY
  4. MEM_LOAD_RETIRED.L2_LINE_MISS
  5. L2_LINES_IN.SELF.PREFETCH
  6. BUS_DRDY_CLOCKS.ALL_AGENTS
  7. BUS_TRANS_ANY.ALL_AGENTS
  8. CPU_CLK_UNHALTED.BUS

I think by now, you should have fairly clear picture what you can try doing.

~BR

Mukkaysh Srivastav

View solution in original post

0 Kudos
14 Replies
jimdempseyatthecove
Honored Contributor III
1,220 Views

The X5400 has two L2 caches. Two of the cores share one L2, the other two cores share the other L2.In the event that each core can saturate the L2 cache to which itis attached you would then expect 1x faster (same speed). However you are experiencing 2x faster. There are two things that could account for this a) cache eviction, b) excessively long interlock contention (for spinlock or other atomic operations).

For the case a) (cache eviction) see if you can reduce the immediate working set (partition you data into smaller pieces). For this processor (and 4 cores running)the immediate data set should not exceed 3MB (half of either of the two 6MB L2 caches).

For case b) see if you can reduce the number of interlocked operations (generally by partitioning of the work and/or use of additional temporaries for use as reduction variables).

Jim Dempsey
0 Kudos
tiagosazevedo
Beginner
1,220 Views
Hello Jim, thanks for the answer. We'll try to reconfigure our job and launch it again. Meanwhile i have another questions: could the memory access be the bottleneck? I have been read that the motherboard has only two FSB 1333MHz, one to each processor, could this degrade the performance? Could the new Intel Nehalem microarchitecture improve the performance?

Thanks in advance!

Quoting - jimdempseyatthecove

The X5400 has two L2 caches. Two of the cores share one L2, the other two cores share the other L2.In the event that each core can saturate the L2 cache to which itis attached you would then expect 1x faster (same speed). However you are experiencing 2x faster. There are two things that could account for this a) cache eviction, b) excessively long interlock contention (for spinlock or other atomic operations).

For the case a) (cache eviction) see if you can reduce the immediate working set (partition you data into smaller pieces). For this processor (and 4 cores running)the immediate data set should not exceed 3MB (half of either of the two 6MB L2 caches).

For case b) see if you can reduce the number of interlocked operations (generally by partitioning of the work and/or use of additional temporaries for use as reduction variables).

Jim Dempsey

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,220 Views

I suggest you examine the cause of the performance issue using TProf and/or VTune and/or PTU.

Check to see if your application had auto tuning capability whereby it would figure out how large the L2 cache was and then optimally tune the application for that size cache under the assumption of 1 core per L2 cache which was the case for older processor models (and some of the next gen processors as well) but is not the case for yur XZ5400 series. This should be a relatively easy thing to check out.

If there are hard wired parameters, then if possible, try halving a parameter that would reduce an immediate working set by a factor of 2.

If this were an FSB problem, my guess is the application performance would plateau. This is not the case as you reported. In your first post you stated to the effect that performance dropped by a factor of 2. This would seem to indicate cache eviction (which would cause you to encounter higher FSB activity).

While the new processor _may_ speed things along, it would not fix an under-laying problem, that if identified now, and fixed now, would improve the4 core run performance by a factor of 4 or so (giving you net 2x over 2 cores). This performance gain, once attained, will carry forward to some degree to the newer architecture.

Jim Dempsey
0 Kudos
robert-reed
Valued Contributor II
1,220 Views
I suggest you examine the cause of the performance issue using TProf and/or VTune and/or PTU.

Check to see if your application had auto tuning capability whereby it would figure out how large the L2 cache was and then optimally tune the application for that size cache under the assumption of 1 core per L2 cache which was the case for older processor models (and some of the next gen processors as well) but is not the case for yur XZ5400 series. This should be a relatively easy thing to check out.

I agree with Jim that you should try to do some analysis of the code, but it's not as clear to me that the problem you're seeing is due to L2 saturation. How big a fluid volume are you using for your calculations? (Will it even fit in L2?)Is the volume shared across all eight orsixteen machines? If two cores per bus are close to saturation, assuming optimal scheduling of threads to cores, then running four cores per bus might well halve performance.

The IntelCoreTMi7 processor offers integrated memory controllers per socket, the Intel Quickpath interconnectand faster memory, all which might benefit a memory-bound application.

0 Kudos
tiagosazevedo
Beginner
1,220 Views

I agree with Jim that you should try to do some analysis of the code, but it's not as clear to me that the problem you're seeing is due to L2 saturation. How big a fluid volume are you using for your calculations? (Will it even fit in L2?)Is the volume shared across all eight orsixteen machines? If two cores per bus are close to saturation, assuming optimal scheduling of threads to cores, then running four cores per bus might well halve performance.

The IntelCoreTM i7 processor offers integrated memory controllers per socket, the Intel Quickpath interconnectand faster memory, all which might benefit a memory-bound application.


Thanks for the answer Mr. Reed,

i can't analyse the code because it is a proprietary code and i don't have access to the input data too. I think that the volume is shared across all the machines.

We have been doing another tests with a second app, but at this time, the performance with 4 threads has the same time that the performance with 6 or 8 threads. Summarizing, we have 2 apps


App1 - Fluid Dynamics:

- we use hpmpi 2.3
- the time with 4 threads is 2x faster then 8 threads


App 2 - Liquid Migration:

- we use openmpi 1.2.6
- the time with 4 threads is the same that with 6 or 8 threads


We are now trying to learn how to use vtune to identify the problem. It could be a saturation problem ? A limitation of the Xeon core 2 quad 5450?


Thanks
0 Kudos
TimP
Honored Contributor III
1,220 Views
In accordance with the likely suggestion made earlier, that your application is cache hungry, VTune would verify that hypothesis by showing you a lower cache hit rate as you increase the number of threads. You could call it a limitation of the Xeon 5400 series, compared with the Xeon 5500 series, as the newer CPU supports 3 memory channels, and could satisfy out-of-cache access with corresponding greater speed.
With hp-mpi, of course, you should be using the -cpu_bind:MAPcpu table to unscramble the core numbering. If you let the affinity float, you won't get repeatable results under VTune, but still you won't see the best or worst. Intel MPI has built-in understanding of the core numbering sequence for Intel CPUs. If openmpi expects you to use taskset explicitly, again you will need to specify the core number table.
0 Kudos
srimks
New Contributor II
1,221 Views
Quoting - tiagosazevedo

I agree with Jim that you should try to do some analysis of the code, but it's not as clear to me that the problem you're seeing is due to L2 saturation. How big a fluid volume are you using for your calculations? (Will it even fit in L2?) Is the volume shared across all eight or sixteen machines? If two cores per bus are close to saturation, assuming optimal scheduling of threads to cores, then running four cores per bus might well halve performance.

The Intel CoreTM i7 processor offers integrated memory controllers per socket, the Intel Quickpath interconnect and faster memory, all which might benefit a memory-bound application.


Thanks for the answer Mr. Reed,

i can't analyse the code because it is a proprietary code and i don't have access to the input data too. I think that the volume is shared across all the machines.

We have been doing another tests with a second app, but at this time, the performance with 4 threads has the same time that the performance with 6 or 8 threads. Summarizing, we have 2 apps


App1 - Fluid Dynamics:

- we use hpmpi 2.3
- the time with 4 threads is 2x faster then 8 threads


App 2 - Liquid Migration:

- we use openmpi 1.2.6
- the time with 4 threads is the same that with 6 or 8 threads


We are now trying to learn how to use vtune to identify the problem. It could be a saturation problem ? A limitation of the Xeon core 2 quad 5450?


Thanks

Intel Xeon 5xxx series machine has 2 die, meaning 4 core per die, so in total 5450 has 8 cores. Perform "less /proc/cpuinfo" to know it details.

You qouted "i can't analyse the code because it is a proprietary code and i don't have access to the input data too. I think that the volume is shared across all the machines."

Intel VTune does perform analysis for binary code, please check the Getting_Starter Guide. This would give an impression about the flow. To analyze in terms of MPI, you have to install VTune for selective nodes (here from your question, it seems that you have 8 nodes (or m/c.). Atleast, you can check the behaviour for L2 cache and than conclude.

You qouted "could the memory access be the bottleneck", yes certainly old Intel Xeon Processor uses FSB concept but new processor (Nehalem) uses QPI which is better than old concepts of FSB. Apart from QPI, Nehalem uses SMT parallelism which is much better than Intel old processor which uses TLP & ILP but TLP cannot exploit ILP properly in old Intel Xeon processor, but with Nehalem SMT parallelism, TLP can exploit ILP efficiently, moreover Nehalem each core supports 2 threads per core. So, by having all these extra/new parameters, Nehalem should give better performance & scalability in terms of old Intel Xeon 5xxx processors. You can try comparing for better understanding.

You qouted "the performance with 4 threads has the same time that the performance with 6 or 8 threads", try analyzing thread & MPI behaviour using Intel Trace Analyzer & Collector. Intel Trace Analyzer & Collector only supports HP-MPI, Intel-MPI & MVAPICH but not OpenMPI.

You qouted "We are now trying to learn how to use vtune to identify the problem. It could be a saturation problem", do you mean BUS Saturation, try running EBS with "Advance Performance Tuning" events button. Try performing Bus Saturation if it's there with following EBS events -

  1. CPU_CLK_UNHALTED.CORE
  2. L2_LINES_IN.SELF.ANY
  3. INST_RETIRED.ANY
  4. MEM_LOAD_RETIRED.L2_LINE_MISS
  5. L2_LINES_IN.SELF.PREFETCH
  6. BUS_DRDY_CLOCKS.ALL_AGENTS
  7. BUS_TRANS_ANY.ALL_AGENTS
  8. CPU_CLK_UNHALTED.BUS

I think by now, you should have fairly clear picture what you can try doing.

~BR

Mukkaysh Srivastav

0 Kudos
smallrabbit
Beginner
1,220 Views
Mr Reed.
Do all those features (especially integrated memory controller) really boost i7 performance over previous core generation any notably. And can you tell me if all those features are also present in mobile cpu i7 line? These features sound like they could bring some more performance over previous generation than we were used to in the past.

www.pcterritory.net


0 Kudos
TimP
Honored Contributor III
1,220 Views
3 channels of DDR3-1333 (or even 1066) with unified cache is quite a boost over 1 channel of DDR2 with split cache.
The Core 2 mobile CPU will continue as top of that line for a while longer, pending development of 32nm CPUs.
0 Kudos
robert-reed
Valued Contributor II
1,220 Views
Quoting - smallrabbit
Do all those features (especially integrated memory controller) really boost i7 performance over previous core generation any notably. And can you tell me if all those features are also present in mobile cpu i7 line? These features sound like they could bring some more performance over previous generation than we were used to in the past.
Your mileage may vary, depending on the nature of your algorithms, but the integrated memory controller should drop the number of propagation delays involved in accessing memory so memory operations on the Intel Core i7 should run faster, as a lot of users have already discovered. That should be true of the mobile version as well, though the different incarnations of the architecture may vary in number of IntelQPI ports, etc, that are available.
0 Kudos
zbestzeus
Beginner
1,220 Views
I think there is a serious issue with the new quad xeon design. I've got two of the new w5580 Xeon's and nomatter what i do they are half the speed of a Core i7 965 i have. The I7 has less ram, not as nice of a motherboard, and the same graphics card. We also have 2 of the 5400's and they are so slow! To put it in perspective, when i run 3dmark Vantage.... the Core i7 gets a 40,000 ranking..... the dual w5580's get a 15,000 ranking and that's strictly on the cpu benchmark with the same graphics card. I thought maybe it was an error in 3dmark so i tried SpecPerfView10 and the Core i7 got double the performance on every test, it even handled threads better than the 8 core cpu! I tried enabling and disabling hyperthreading, no difference. I tried using both windows xp 64 and vista 64 and no difference. I was thinking it was a windows issue but apparently it's happening in Linux too. These cpu's look like power houses on paper but they absolutely suck when two are paired together. From what I see when you have a single one they work perfectly fine.
http://www.reghardware.co.uk/2009/04/01/review_cpu_intel_xeon_5500/page6.html
Check this out, the dual w5580's are slower than a single one.

Does anybody have a rational explaination?
0 Kudos
TimP
Honored Contributor III
1,220 Views
Was the benchmark written so as to depend on the BIOS numbering order of the Xeon 5400 series dual quad cores, and so uses the worst possible sequence on the newer one.
I read in the ExtremeTech review that the object of your benchmark is to predict graphics performance of future games on the platforms for which the benchmark is written, not, as you have done, to evaluate CPU performance on a platform introduced after it was written.
I don't know how you could relate the situation on Windows, where you presumably have drivers for the video card and DirectX, to linux, where there is no DirectX, and you are lucky if there is even VESA support for the fancy graphics card. Or, maybe you didn't have the right driver for your card.
If your point is that the Xeon 5500 series wasn't designed to compete in the gaming business with the Core i7, I guess you can define that as "a serious issue."
If you go looking for ways to write a benchmark which runs slower on a dual socket than a single socket machine, you'll surely find them.
0 Kudos
zbestzeus
Beginner
1,220 Views
Not sure if you entirely understood my post, this isn't a driver issue, everything is up to date, comparing across the same OS. The only difference is the CPU and the motherboard, ram/graphics/HDD are of same performance. Reguardless of whether the CPU is an i7 or a xeon they should be of relatively the same performance in a game since they use the same design. The xeon should in fact be better than the i7 since it a more refined design. Granted the 3dmark benchmark utility does not strictly test the cpu. If both computers use the same graphics card and ram and the only difference being the cpu, the obviously culprit for the difference should be the cpu. We're not talking a small difference, we're talking a HUGE difference. Now my second test which you didn't take a look at was SpecViewPerf10, take a look at their website: http://www.spec.org/benchmarks.html. This application is the standard for testing how well a workstation performs for video editing. The Core i7 computer has double the performance in every test. A dual xeon w5580 obviously has more processing power than a single Core i7, but the fact remains in all tests performed it is slower. We have taken out the extra processor so it has a single xeon w5580 and the computer performs slightly better in benchmarks. This makes no logical sense, it seems like there is an error somewhere in the design when using the dual processors. I would assume the only people who would find a need for dual cpu's are people doing video editing or running a server. The person who started the post said when he used 16 4 core machines the performance was double that of 8 8 core machines. Here we have the problem I am experiencing, when using 8 cores it's as if only 4 cores are being used. If both video editors and servers are not able to use the dual cpu's performance, i feel this is a SERIOUS problem.
0 Kudos
tiagosazevedo
Beginner
1,220 Views

Hi everybody,

i ran a some test with the fluid dynamics problem using the vtune and i got the result below:

Mac/threads Process Bus Utilization Clocks per Instructions Retired - CPI DTLB Miss Rate Data Bus Utilization ITLB Miss Rate L1 Data Cache Miss Performance Impact L1 Data Cache Miss Rate L2 Cache Demand Miss Rate L2 Cache Miss Rate L2 Modified Lines Eviction Rate Locked Operations Impact TLB miss penalty CPU_CLK_UNHALTED.CORE % INST_RETIRED.ANY % DTLB_MISSES.ANY % ITLB_MISS_RETIRED % PAGE_WALKS.CYCLES % BUS_TRANS_ANY.ALL_AGENTS % CPU_CLK_UNHALTED.BUS % BUS_DRDY_CLOCKS.ALL_AGENTS % L1D_REPL % L2_LINES_IN.SELF.DEMAND % L2_LINES_IN.SELF.ANY % L2_M_LINES_OUT.SELF.ANY % L1D_CACHE_LOCK.MESI % L1D_CACHE_LOCK_DURATION %
2Machines 6threads(3perM) solver-hpmpi.exe 45.56% 0.832 0.000 19.64% 0.000 13.83% 0.014 0.001 0.006 0.002 0.11% 0.55% 96.86% 99.11% 93.50% 73.84% 92.57% 39.06% 96.89% 41.40% 99.49% 97.18% 99.43% 99.56% 63.92% 42.37%




























2Mac 8threads(4perM) solver-hpmpi.exe 51.02% 0.843 0.000 20.64% 0.000 12.42% 0.013 0.001 0.005 0.002 0.11% 0.55% 97.79% 99.32% 94.19% 75.00% 93.97% 49.89% 97.82% 49.84% 99.60% 98.06% 99.41% 99.61% 71.27% 51.38%




























2Mac 12threads(6perM) solver-hpmpi.exe 61.96% 0.876 0.000 23.36% 0.000 8.95% 0.010 0.001 0.004 0.002 0.10% 0.63% 98.38% 99.28% 92.44% 39.29% 94.37% 75.10% 98.73% 74.83% 99.64% 99.54% 99.58% 99.67% 65.74% 61.45%




























2M 16threads(8perM) solver-hpmpi.exe 64.79% 0.960 0.000 25.85% 0.000 7.13% 0.009 0.001 0.004 0.001 0.09% 0.67% 99.30% 99.63% 96.13% 77.57% 94.32% 99.38% 99.34% 99.39% 99.64% 98.94% 99.49% 99.49% 83.01% 71.16%




























2M 24threads(12perM) solver-hpmpi.exe 2.81% 1.591 0.004 1.95% 0.003 0.93% 0.002 0.000 0.000 0.000 1.94% 4.69% 99.51% 99.38% 99.77% 99.97% 99.80% 95.37% 99.55% 97.25% 97.25% 87.96% 95.43% 97.46% 99.22% 94.78%
1M 3threads solver-hpmpi.exe 49.46% 0.829 0.000 21.01% 0.000 13.73% 0.014 0.001 0.006 0.002 0.10% 0.59% 96.97% 99.16% 93.98% 70.59% 93.31% 38.99% 97.06% 41.48% 99.50% 98.21% 99.44% 99.63% 63.35% 43.33%




























1M 4threads solver-hpmpi.exe 56.72% 0.910 0.000 22.77% 0.000 12.77% 0.015 0.001 0.006 0.002 0.06% 0.60% 97.85% 99.32% 93.46% 75.33% 95.44% 49.86% 49.86% 97.79% 99.60% 97.85% 99.50% 99.46% 48.03% 42.09%




























1M 6threads solver-hpmpi.exe 65.39% 0.976 0.000 26.33% 0.000 8.59% 0.010 0.001 0.005 0.002 0.06% 0.66% 98.96% 99.54% 96.77% 75.82% 95.74% 75.19% 98.93% 75.09% 99.63% 99.08% 99.60% 99.59% 59.28% 64.92%




























1M 8threads solver-hpmpi.exe 75.74% 1.254 0.000 29.86% 0.000 7.65% 0.012 0.001 0.005 0.002 0.33% 0.79% 99.44% 99.57% 97.07% 82.87% 95.98% 99.56% 99.46% 99.58% 99.71% 99.32% 99.71% 99.64% 97.13% 93.32%




























1M 12threads solver-hpmpi.exe 20.77% 1.628 0.005 1.19% 0.003 2.65% 0.005 0.000 0.003 0.001 3.96% 3.51% 99.60% 99.40% 99.79% 99.95% 99.54% 99.27% 99.58% 96.75% 56.73% 27.64% 99.16% 99.40% 95.11% 77.19%

Analysing the results, i think that there is no cache eviction, nor excessively interlock contention and it seems that the bus is not overloaded either.
I can not figure out why the bus utilization decrease when i use more then 8 threads per machine.

Can anyone explain these numbers ?

Thanks!

Quoting - zbestzeus
Not sure if you entirely understood my post, this isn't a driver issue, everything is up to date, comparing across the same OS. The only difference is the CPU and the motherboard, ram/graphics/HDD are of same performance. Reguardless of whether the CPU is an i7 or a xeon they should be of relatively the same performance in a game since they use the same design. The xeon should in fact be better than the i7 since it a more refined design. Granted the 3dmark benchmark utility does not strictly test the cpu. If both computers use the same graphics card and ram and the only difference being the cpu, the obviously culprit for the difference should be the cpu. We're not talking a small difference, we're talking a HUGE difference. Now my second test which you didn't take a look at was SpecViewPerf10, take a look at their website: http://www.spec.org/benchmarks.html. This application is the standard for testing how well a workstation performs for video editing. The Core i7 computer has double the performance in every test. A dual xeon w5580 obviously has more processing power than a single Core i7, but the fact remains in all tests performed it is slower. We have taken out the extra processor so it has a single xeon w5580 and the computer performs slightly better in benchmarks. This makes no logical sense, it seems like there is an error somewhere in the design when using the dual processors. I would assume the only people who would find a need for dual cpu's are people doing video editing or running a server. The person who started the post said when he used 16 4 core machines the performance was double that of 8 8 core machines. Here we have the problem I am experiencing, when using 8 cores it's as if only 4 cores are being used. If both video editors and servers are not able to use the dual cpu's performance, i feel this is a SERIOUS problem.

0 Kudos
Reply