Solved: Re: Is Intel Xeon Quad Core a real quad core processor ?

tiagosazevedo · ‎03-25-2009

Hi,

we have some dual intel xeon quad core 5450 machines(8 core total) running Centos 4.5 for hpc. These machines have Infiniband so network isn't a bottleneck. We run several tests with Fluid Dynamics Software, using 64 cores (8 machines 8 cores or 16 machines 4 cores) and check that when we use 4 cores per machine the performance in time is better (2x faster) then 8 cores per machine. Does anybody know why this happen?

Best regards

srimks · ‎04-06-2009

Quoting - tiagosazevedo

Quoting - Robert Reed (Intel)

I agree with Jim that you should try to do some analysis of the code, but it's not as clear to me that the problem you're seeing is due to L2 saturation. How big a fluid volume are you using for your calculations? (Will it even fit in L2?) Is the volume shared across all eight or sixteen machines? If two cores per bus are close to saturation, assuming optimal scheduling of threads to cores, then running four cores per bus might well halve performance.

The Intel Core^TM i7 processor offers integrated memory controllers per socket, the Intel Quickpath interconnect and faster memory, all which might benefit a memory-bound application.

Thanks for the answer Mr. Reed,

i can't analyse the code because it is a proprietary code and i don't have access to the input data too. I think that the volume is shared across all the machines.

We have been doing another tests with a second app, but at this time, the performance with 4 threads has the same time that the performance with 6 or 8 threads. Summarizing, we have 2 apps

App1 - Fluid Dynamics:

- we use hpmpi 2.3
- the time with 4 threads is 2x faster then 8 threads

App 2 - Liquid Migration:

- we use openmpi 1.2.6
- the time with 4 threads is the same that with 6 or 8 threads

We are now trying to learn how to use vtune to identify the problem. It could be a saturation problem ? A limitation of the Xeon core 2 quad 5450?

Thanks

Intel Xeon 5xxx series machine has 2 die, meaning 4 core per die, so in total 5450 has 8 cores. Perform "less /proc/cpuinfo" to know it details.

You qouted "i can't analyse the code because it is a proprietary code and i don't have access to the input data too. I think that the volume is shared across all the machines."

Intel VTune does perform analysis for binary code, please check the Getting_Starter Guide. This would give an impression about the flow. To analyze in terms of MPI, you have to install VTune for selective nodes (here from your question, it seems that you have 8 nodes (or m/c.). Atleast, you can check the behaviour for L2 cache and than conclude.

You qouted "could the memory access be the bottleneck", yes certainly old Intel Xeon Processor uses FSB concept but new processor (Nehalem) uses QPI which is better than old concepts of FSB. Apart from QPI, Nehalem uses SMT parallelism which is much better than Intel old processor which uses TLP & ILP but TLP cannot exploit ILP properly in old Intel Xeon processor, but with Nehalem SMT parallelism, TLP can exploit ILP efficiently, moreover Nehalem each core supports 2 threads per core. So, by having all these extra/new parameters, Nehalem should give better performance & scalability in terms of old Intel Xeon 5xxx processors. You can try comparing for better understanding.

You qouted "the performance with 4 threads has the same time that the performance with 6 or 8 threads", try analyzing thread & MPI behaviour using Intel Trace Analyzer & Collector. Intel Trace Analyzer & Collector only supports HP-MPI, Intel-MPI & MVAPICH but not OpenMPI.

You qouted "We are now trying to learn how to use vtune to identify the problem. It could be a saturation problem", do you mean BUS Saturation, try running EBS with "Advance Performance Tuning" events button. Try performing Bus Saturation if it's there with following EBS events -

CPU_CLK_UNHALTED.CORE
L2_LINES_IN.SELF.ANY
INST_RETIRED.ANY
MEM_LOAD_RETIRED.L2_LINE_MISS
L2_LINES_IN.SELF.PREFETCH
BUS_DRDY_CLOCKS.ALL_AGENTS
BUS_TRANS_ANY.ALL_AGENTS
CPU_CLK_UNHALTED.BUS

I think by now, you should have fairly clear picture what you can try doing.

~BR

Mukkaysh Srivastav

View solution in original post

jimdempseyatthecove · ‎03-25-2009

The X5400 has two L2 caches. Two of the cores share one L2, the other two cores share the other L2.In the event that each core can saturate the L2 cache to which itis attached you would then expect 1x faster (same speed). However you are experiencing 2x faster. There are two things that could account for this a) cache eviction, b) excessively long interlock contention (for spinlock or other atomic operations).

For the case a) (cache eviction) see if you can reduce the immediate working set (partition you data into smaller pieces). For this processor (and 4 cores running)the immediate data set should not exceed 3MB (half of either of the two 6MB L2 caches).

For case b) see if you can reduce the number of interlocked operations (generally by partitioning of the work and/or use of additional temporaries for use as reduction variables).

Jim Dempsey

tiagosazevedo · ‎03-25-2009

Hello Jim, thanks for the answer. We'll try to reconfigure our job and launch it again. Meanwhile i have another questions: could the memory access be the bottleneck? I have been read that the motherboard has only two FSB 1333MHz, one to each processor, could this degrade the performance? Could the new Intel Nehalem microarchitecture improve the performance?

Thanks in advance!

Quoting - jimdempseyatthecove

The X5400 has two L2 caches. Two of the cores share one L2, the other two cores share the other L2.In the event that each core can saturate the L2 cache to which itis attached you would then expect 1x faster (same speed). However you are experiencing 2x faster. There are two things that could account for this a) cache eviction, b) excessively long interlock contention (for spinlock or other atomic operations).

For the case a) (cache eviction) see if you can reduce the immediate working set (partition you data into smaller pieces). For this processor (and 4 cores running)the immediate data set should not exceed 3MB (half of either of the two 6MB L2 caches).

For case b) see if you can reduce the number of interlocked operations (generally by partitioning of the work and/or use of additional temporaries for use as reduction variables).

Jim Dempsey

jimdempseyatthecove · ‎03-25-2009

I suggest you examine the cause of the performance issue using TProf and/or VTune and/or PTU.

Check to see if your application had auto tuning capability whereby it would figure out how large the L2 cache was and then optimally tune the application for that size cache under the assumption of 1 core per L2 cache which was the case for older processor models (and some of the next gen processors as well) but is not the case for yur XZ5400 series. This should be a relatively easy thing to check out.

If there are hard wired parameters, then if possible, try halving a parameter that would reduce an immediate working set by a factor of 2.

If this were an FSB problem, my guess is the application performance would plateau. This is not the case as you reported. In your first post you stated to the effect that performance dropped by a factor of 2. This would seem to indicate cache eviction (which would cause you to encounter higher FSB activity).

While the new processor _may_ speed things along, it would not fix an under-laying problem, that if identified now, and fixed now, would improve the4 core run performance by a factor of 4 or so (giving you net 2x over 2 cores). This performance gain, once attained, will carry forward to some degree to the newer architecture.

Jim Dempsey

robert-reed · ‎03-29-2009

Quoting - jimdempseyatthecove

I suggest you examine the cause of the performance issue using TProf and/or VTune and/or PTU.

Check to see if your application had auto tuning capability whereby it would figure out how large the L2 cache was and then optimally tune the application for that size cache under the assumption of 1 core per L2 cache which was the case for older processor models (and some of the next gen processors as well) but is not the case for yur XZ5400 series. This should be a relatively easy thing to check out.

I agree with Jim that you should try to do some analysis of the code, but it's not as clear to me that the problem you're seeing is due to L2 saturation. How big a fluid volume are you using for your calculations? (Will it even fit in L2?)Is the volume shared across all eight orsixteen machines? If two cores per bus are close to saturation, assuming optimal scheduling of threads to cores, then running four cores per bus might well halve performance.

The IntelCore^TMi7 processor offers integrated memory controllers per socket, the Intel Quickpath interconnectand faster memory, all which might benefit a memory-bound application.

tiagosazevedo · ‎04-06-2009

Quoting - Robert Reed (Intel)

I agree with Jim that you should try to do some analysis of the code, but it's not as clear to me that the problem you're seeing is due to L2 saturation. How big a fluid volume are you using for your calculations? (Will it even fit in L2?)Is the volume shared across all eight orsixteen machines? If two cores per bus are close to saturation, assuming optimal scheduling of threads to cores, then running four cores per bus might well halve performance.

The IntelCore^TM i7 processor offers integrated memory controllers per socket, the Intel Quickpath interconnectand faster memory, all which might benefit a memory-bound application.

Thanks for the answer Mr. Reed,

i can't analyse the code because it is a proprietary code and i don't have access to the input data too. I think that the volume is shared across all the machines.

We have been doing another tests with a second app, but at this time, the performance with 4 threads has the same time that the performance with 6 or 8 threads. Summarizing, we have 2 apps

App1 - Fluid Dynamics:

- we use hpmpi 2.3
- the time with 4 threads is 2x faster then 8 threads

App 2 - Liquid Migration:

- we use openmpi 1.2.6
- the time with 4 threads is the same that with 6 or 8 threads

We are now trying to learn how to use vtune to identify the problem. It could be a saturation problem ? A limitation of the Xeon core 2 quad 5450?

Thanks

TimP · ‎04-06-2009

In accordance with the likely suggestion made earlier, that your application is cache hungry, VTune would verify that hypothesis by showing you a lower cache hit rate as you increase the number of threads. You could call it a limitation of the Xeon 5400 series, compared with the Xeon 5500 series, as the newer CPU supports 3 memory channels, and could satisfy out-of-cache access with corresponding greater speed.
With hp-mpi, of course, you should be using the -cpu_bind:MAPcpu table to unscramble the core numbering. If you let the affinity float, you won't get repeatable results under VTune, but still you won't see the best or worst. Intel MPI has built-in understanding of the core numbering sequence for Intel CPUs. If openmpi expects you to use taskset explicitly, again you will need to specify the core number table.

srimks · ‎04-06-2009

Quoting - tiagosazevedo

Quoting - Robert Reed (Intel)

I agree with Jim that you should try to do some analysis of the code, but it's not as clear to me that the problem you're seeing is due to L2 saturation. How big a fluid volume are you using for your calculations? (Will it even fit in L2?) Is the volume shared across all eight or sixteen machines? If two cores per bus are close to saturation, assuming optimal scheduling of threads to cores, then running four cores per bus might well halve performance.

The Intel Core^TM i7 processor offers integrated memory controllers per socket, the Intel Quickpath interconnect and faster memory, all which might benefit a memory-bound application.

Thanks for the answer Mr. Reed,

i can't analyse the code because it is a proprietary code and i don't have access to the input data too. I think that the volume is shared across all the machines.

We have been doing another tests with a second app, but at this time, the performance with 4 threads has the same time that the performance with 6 or 8 threads. Summarizing, we have 2 apps

App1 - Fluid Dynamics:

- we use hpmpi 2.3
- the time with 4 threads is 2x faster then 8 threads

App 2 - Liquid Migration:

- we use openmpi 1.2.6
- the time with 4 threads is the same that with 6 or 8 threads

We are now trying to learn how to use vtune to identify the problem. It could be a saturation problem ? A limitation of the Xeon core 2 quad 5450?

Thanks

Intel Xeon 5xxx series machine has 2 die, meaning 4 core per die, so in total 5450 has 8 cores. Perform "less /proc/cpuinfo" to know it details.

You qouted "i can't analyse the code because it is a proprietary code and i don't have access to the input data too. I think that the volume is shared across all the machines."

Intel VTune does perform analysis for binary code, please check the Getting_Starter Guide. This would give an impression about the flow. To analyze in terms of MPI, you have to install VTune for selective nodes (here from your question, it seems that you have 8 nodes (or m/c.). Atleast, you can check the behaviour for L2 cache and than conclude.

You qouted "could the memory access be the bottleneck", yes certainly old Intel Xeon Processor uses FSB concept but new processor (Nehalem) uses QPI which is better than old concepts of FSB. Apart from QPI, Nehalem uses SMT parallelism which is much better than Intel old processor which uses TLP & ILP but TLP cannot exploit ILP properly in old Intel Xeon processor, but with Nehalem SMT parallelism, TLP can exploit ILP efficiently, moreover Nehalem each core supports 2 threads per core. So, by having all these extra/new parameters, Nehalem should give better performance & scalability in terms of old Intel Xeon 5xxx processors. You can try comparing for better understanding.

You qouted "the performance with 4 threads has the same time that the performance with 6 or 8 threads", try analyzing thread & MPI behaviour using Intel Trace Analyzer & Collector. Intel Trace Analyzer & Collector only supports HP-MPI, Intel-MPI & MVAPICH but not OpenMPI.

You qouted "We are now trying to learn how to use vtune to identify the problem. It could be a saturation problem", do you mean BUS Saturation, try running EBS with "Advance Performance Tuning" events button. Try performing Bus Saturation if it's there with following EBS events -

CPU_CLK_UNHALTED.CORE
L2_LINES_IN.SELF.ANY
INST_RETIRED.ANY
MEM_LOAD_RETIRED.L2_LINE_MISS
L2_LINES_IN.SELF.PREFETCH
BUS_DRDY_CLOCKS.ALL_AGENTS
BUS_TRANS_ANY.ALL_AGENTS
CPU_CLK_UNHALTED.BUS

I think by now, you should have fairly clear picture what you can try doing.

~BR

Mukkaysh Srivastav

smallrabbit · ‎04-09-2009

Mr Reed.
Do all those features (especially integrated memory controller) really boost i7 performance over previous core generation any notably. And can you tell me if all those features are also present in mobile cpu i7 line? These features sound like they could bring some more performance over previous generation than we were used to in the past.

www.pcterritory.net

TimP · ‎04-09-2009

3 channels of DDR3-1333 (or even 1066) with unified cache is quite a boost over 1 channel of DDR2 with split cache.
The Core 2 mobile CPU will continue as top of that line for a while longer, pending development of 32nm CPUs.

robert-reed · ‎04-09-2009

Quoting - smallrabbit

Do all those features (especially integrated memory controller) really boost i7 performance over previous core generation any notably. And can you tell me if all those features are also present in mobile cpu i7 line? These features sound like they could bring some more performance over previous generation than we were used to in the past.

Your mileage may vary, depending on the nature of your algorithms, but the integrated memory controller should drop the number of propagation delays involved in accessing memory so memory operations on the Intel Core i7 should run faster, as a lot of users have already discovered. That should be true of the mobile version as well, though the different incarnations of the architecture may vary in number of IntelQPI ports, etc, that are available.

zbestzeus · ‎04-15-2009

I think there is a serious issue with the new quad xeon design. I've got two of the new w5580 Xeon's and nomatter what i do they are half the speed of a Core i7 965 i have. The I7 has less ram, not as nice of a motherboard, and the same graphics card. We also have 2 of the 5400's and they are so slow! To put it in perspective, when i run 3dmark Vantage.... the Core i7 gets a 40,000 ranking..... the dual w5580's get a 15,000 ranking and that's strictly on the cpu benchmark with the same graphics card. I thought maybe it was an error in 3dmark so i tried SpecPerfView10 and the Core i7 got double the performance on every test, it even handled threads better than the 8 core cpu! I tried enabling and disabling hyperthreading, no difference. I tried using both windows xp 64 and vista 64 and no difference. I was thinking it was a windows issue but apparently it's happening in Linux too. These cpu's look like power houses on paper but they absolutely suck when two are paired together. From what I see when you have a single one they work perfectly fine.
http://www.reghardware.co.uk/2009/04/01/review_cpu_intel_xeon_5500/page6.html
Check this out, the dual w5580's are slower than a single one.

Does anybody have a rational explaination?

TimP · ‎04-15-2009

Was the benchmark written so as to depend on the BIOS numbering order of the Xeon 5400 series dual quad cores, and so uses the worst possible sequence on the newer one.
I read in the ExtremeTech review that the object of your benchmark is to predict graphics performance of future games on the platforms for which the benchmark is written, not, as you have done, to evaluate CPU performance on a platform introduced after it was written.
I don't know how you could relate the situation on Windows, where you presumably have drivers for the video card and DirectX, to linux, where there is no DirectX, and you are lucky if there is even VESA support for the fancy graphics card. Or, maybe you didn't have the right driver for your card.
If your point is that the Xeon 5500 series wasn't designed to compete in the gaming business with the Core i7, I guess you can define that as "a serious issue."
If you go looking for ways to write a benchmark which runs slower on a dual socket than a single socket machine, you'll surely find them.

zbestzeus · ‎04-16-2009

Not sure if you entirely understood my post, this isn't a driver issue, everything is up to date, comparing across the same OS. The only difference is the CPU and the motherboard, ram/graphics/HDD are of same performance. Reguardless of whether the CPU is an i7 or a xeon they should be of relatively the same performance in a game since they use the same design. The xeon should in fact be better than the i7 since it a more refined design. Granted the 3dmark benchmark utility does not strictly test the cpu. If both computers use the same graphics card and ram and the only difference being the cpu, the obviously culprit for the difference should be the cpu. We're not talking a small difference, we're talking a HUGE difference. Now my second test which you didn't take a look at was SpecViewPerf10, take a look at their website: http://www.spec.org/benchmarks.html. This application is the standard for testing how well a workstation performs for video editing. The Core i7 computer has double the performance in every test. A dual xeon w5580 obviously has more processing power than a single Core i7, but the fact remains in all tests performed it is slower. We have taken out the extra processor so it has a single xeon w5580 and the computer performs slightly better in benchmarks. This makes no logical sense, it seems like there is an error somewhere in the design when using the dual processors. I would assume the only people who would find a need for dual cpu's are people doing video editing or running a server. The person who started the post said when he used 16 4 core machines the performance was double that of 8 8 core machines. Here we have the problem I am experiencing, when using 8 cores it's as if only 4 cores are being used. If both video editors and servers are not able to use the dual cpu's performance, i feel this is a SERIOUS problem.

tiagosazevedo · ‎05-04-2009

Hi everybody,

i ran a some test with the fluid dynamics problem using the vtune and i got the result below:

Mac/threads	Process	Bus Utilization	Clocks per Instructions Retired - CPI	DTLB Miss Rate	Data Bus Utilization	ITLB Miss Rate	L1 Data Cache Miss Performance Impact	L1 Data Cache Miss Rate	L2 Cache Demand Miss Rate	L2 Cache Miss Rate	L2 Modified Lines Eviction Rate	Locked Operations Impact	TLB miss penalty	CPU_CLK_UNHALTED.CORE %	INST_RETIRED.ANY %	DTLB_MISSES.ANY %	ITLB_MISS_RETIRED %	PAGE_WALKS.CYCLES %	BUS_TRANS_ANY.ALL_AGENTS %	CPU_CLK_UNHALTED.BUS %	BUS_DRDY_CLOCKS.ALL_AGENTS %	L1D_REPL %	L2_LINES_IN.SELF.DEMAND %	L2_LINES_IN.SELF.ANY %	L2_M_LINES_OUT.SELF.ANY %	L1D_CACHE_LOCK.MESI %	L1D_CACHE_LOCK_DURATION %
2Machines 6threads(3perM)	solver-hpmpi.exe	45.56%	0.832	0.000	19.64%	0.000	13.83%	0.014	0.001	0.006	0.002	0.11%	0.55%	96.86%	99.11%	93.50%	73.84%	92.57%	39.06%	96.89%	41.40%	99.49%	97.18%	99.43%	99.56%	63.92%	42.37%

2Mac 8threads(4perM)	solver-hpmpi.exe	51.02%	0.843	0.000	20.64%	0.000	12.42%	0.013	0.001	0.005	0.002	0.11%	0.55%	97.79%	99.32%	94.19%	75.00%	93.97%	49.89%	97.82%	49.84%	99.60%	98.06%	99.41%	99.61%	71.27%	51.38%

2Mac 12threads(6perM)	solver-hpmpi.exe	61.96%	0.876	0.000	23.36%	0.000	8.95%	0.010	0.001	0.004	0.002	0.10%	0.63%	98.38%	99.28%	92.44%	39.29%	94.37%	75.10%	98.73%	74.83%	99.64%	99.54%	99.58%	99.67%	65.74%	61.45%

2M 16threads(8perM)	solver-hpmpi.exe	64.79%	0.960	0.000	25.85%	0.000	7.13%	0.009	0.001	0.004	0.001	0.09%	0.67%	99.30%	99.63%	96.13%	77.57%	94.32%	99.38%	99.34%	99.39%	99.64%	98.94%	99.49%	99.49%	83.01%	71.16%

2M 24threads(12perM)	solver-hpmpi.exe	2.81%	1.591	0.004	1.95%	0.003	0.93%	0.002	0.000	0.000	0.000	1.94%	4.69%	99.51%	99.38%	99.77%	99.97%	99.80%	95.37%	99.55%	97.25%	97.25%	87.96%	95.43%	97.46%	99.22%	94.78%

1M 3threads	solver-hpmpi.exe	49.46%	0.829	0.000	21.01%	0.000	13.73%	0.014	0.001	0.006	0.002	0.10%	0.59%	96.97%	99.16%	93.98%	70.59%	93.31%	38.99%	97.06%	41.48%	99.50%	98.21%	99.44%	99.63%	63.35%	43.33%

1M 4threads	solver-hpmpi.exe	56.72%	0.910	0.000	22.77%	0.000	12.77%	0.015	0.001	0.006	0.002	0.06%	0.60%	97.85%	99.32%	93.46%	75.33%	95.44%	49.86%	49.86%	97.79%	99.60%	97.85%	99.50%	99.46%	48.03%	42.09%

1M 6threads	solver-hpmpi.exe	65.39%	0.976	0.000	26.33%	0.000	8.59%	0.010	0.001	0.005	0.002	0.06%	0.66%	98.96%	99.54%	96.77%	75.82%	95.74%	75.19%	98.93%	75.09%	99.63%	99.08%	99.60%	99.59%	59.28%	64.92%

1M 8threads	solver-hpmpi.exe	75.74%	1.254	0.000	29.86%	0.000	7.65%	0.012	0.001	0.005	0.002	0.33%	0.79%	99.44%	99.57%	97.07%	82.87%	95.98%	99.56%	99.46%	99.58%	99.71%	99.32%	99.71%	99.64%	97.13%	93.32%

1M 12threads	solver-hpmpi.exe	20.77%	1.628	0.005	1.19%	0.003	2.65%	0.005	0.000	0.003	0.001	3.96%	3.51%	99.60%	99.40%	99.79%	99.95%	99.54%	99.27%	99.58%	96.75%	56.73%	27.64%	99.16%	99.40%	95.11%	77.19%

Analysing the results, i think that there is no cache eviction, nor excessively interlock contention and it seems that the bus is not overloaded either.
I can not figure out why the bus utilization decrease when i use more then 8 threads per machine.

Can anyone explain these numbers ?

Thanks!

Quoting - zbestzeus

Not sure if you entirely understood my post, this isn't a driver issue, everything is up to date, comparing across the same OS. The only difference is the CPU and the motherboard, ram/graphics/HDD are of same performance. Reguardless of whether the CPU is an i7 or a xeon they should be of relatively the same performance in a game since they use the same design. The xeon should in fact be better than the i7 since it a more refined design. Granted the 3dmark benchmark utility does not strictly test the cpu. If both computers use the same graphics card and ram and the only difference being the cpu, the obviously culprit for the difference should be the cpu. We're not talking a small difference, we're talking a HUGE difference. Now my second test which you didn't take a look at was SpecViewPerf10, take a look at their website: http://www.spec.org/benchmarks.html. This application is the standard for testing how well a workstation performs for video editing. The Core i7 computer has double the performance in every test. A dual xeon w5580 obviously has more processing power than a single Core i7, but the fact remains in all tests performed it is slower. We have taken out the extra processor so it has a single xeon w5580 and the computer performs slightly better in benchmarks. This makes no logical sense, it seems like there is an error somewhere in the design when using the dual processors. I would assume the only people who would find a need for dual cpu's are people doing video editing or running a server. The person who started the post said when he used 16 4 core machines the performance was double that of 8 8 core machines. Here we have the problem I am experiencing, when using 8 cores it's as if only 4 cores are being used. If both video editors and servers are not able to use the dual cpu's performance, i feel this is a SERIOUS problem.