Re: How many MMX/SSE units in Core-2 Quad

murzik · ‎08-25-2008

I have a powerful HP comuter with Q9550 (Core 2 Quad CPU). It seems that there is only one MMX/SSE unit shared between all 4 cores.

The reason I think so is the following. I am running a simple program that usses SSE-2.

Running 1 thread achieves 300MB/s.
Running 2 threads achieves 150MB/s per thread.
Running 4 threads achieves 75MB/s per thread.

My laptop with T7250 (Core 2 Duo CPU) exhibits the similar behavior.

Is it true that Core-2 CPUs contain only one MMX/SSE unit?

Thanks!

gabest · ‎08-26-2008

I don't think all the cores could share a single execution unit :) Quads are even two dual cores.It is more likely you are limited by the system memory or the cache. You could upload your code if it is simple enough.

gol · ‎08-29-2008

yes, it's probably the shared cache.

But I wonder what's the worst: a shared cache that brings performances down, or a per-core cache that would still bring performances down because of the OS'es scheduling that will place your threads on random cores, thus they may not always be using the same core & find their cached databack.

I suspect my quad is 2 duals sharing 2 caches or something, because I get pretty weird results when I set thread affinity to specific cores (like a lot better when it's on core 1 & 2 or3 & 4, than on core 1 & 3 or 2 & 3).

I saw Vista supports NUMA, maybe this is a solution?

TimP · ‎08-29-2008

Yes, Core 2 Quad has 2 cores on each L2 cache. I have observed 30% loss in performance when not setting affinity correctly for MPI funnelled application on Core 2 Quad. Yes, each core has its own register set, in case that was the subject of your first post.
In a normal OpenMP application, it's important to get the physical ordering right, so that pairs of threads operating on contiguous data are on the same cache. At least, if you write a benchmark (intentionally or not) to measure TLB stalls, or false sharing, you must account for the mapping you choose, or the variability, should you leave the scheduling to Windows.
Availability of affinity tools is one of the advantages in using the Intel OpenMP library, which is available for VC9 as well as Intel compilers.
Improved scheduling hasn't been accepted for Vista, and is still in the proposal stage for Windows 7.

murzik · ‎08-30-2008

Thank you for the offer to examine the code. While creating a simple test I have realized that my performance was bound to memory accesses. So this solves the problem. Thanks again!

gol · ‎09-05-2008

Availability of affinity tools is one of the advantages in using the Intel OpenMP library, which is available for VC9 as well as Intel compilers.

But you still have to know details about the system's caches in the first place, no? I mean, is there a way to know about it, other than by checking the CPU ID? Otherwise, you can only optimize for the CPU's you know, & it won't be future-proof.
Or maybe there really is a way to know about shared cache through specific flags?

TimP · ‎09-05-2008

The Intel environment variables employ several strategies, including cpuid, to determine the cache topology, and provide diagnostic options to show you what decision was made, assuming correctness of the BIOS.
You have also the option of specifying a mapping of OpenMP threads to logical processors; as you say, this requires you to determine those details yourself, and they may change even with BIOS changes on the same platform.

lxguy · ‎09-13-2008

It's true that those details will change with BIOS changes on the same platform.

levicki · ‎09-30-2008

Most likely your program is poorly optimized and the cores are competing for memory bandwidth. You should consider different data layout or different algorithm.

Shiv_Inside · ‎02-03-2014

Hello,

I just wanted to confirm the conclusion of this topic. So, does each core on a Quad core processor has a SSE unit ? (In total 4 SSE units ? ) If so, the performance is supposed to go up by 4 times compared to sequential code, don't you think ?

Also, if you could post some references on how to spawn threads to each core and controlling them, would be awesome!

I'm a newbie, sorry if I have made any inconsiderable assumptions.

Thanks a lot in advance!

Shiv_Inside · ‎02-03-2014

Hello,

I just wanted to confirm the conclusion of this topic. So, does each core on a Quad core processor has a SSE unit ? (In total 4 SSE units ? ) If so, the performance is supposed to go up by 4 times compared to sequential code, don't you think ?

Also, if you could post some references on how to spawn threads to each core and controlling them, would be awesome!

I'm a newbie, sorry if I have made any inconsiderable assumptions.

Thanks a lot in advance!

jimdempseyatthecove · ‎02-03-2014

Each core has one SSE unit.

In the original post

1T, 300MB.s/T
2T, 150MB/s/T
4T. 75MB/s/T

This indicates the test program is memory bound.

>> If so, the performance is supposed to go up by 4 times compared to sequential code, don't you think ?

Only when the four execution units can be fed when hungry (kept busy). As in the original post, an application might not be able to keep the execution units busy due to memory and/or cache latency issues. As Ivan indicated, improved algorithms can reduce the number of memory and LLC accesses and thus improve the overall performance. One usually observes linear scaling only when none of the memory systems (RAM, L3, L2, L1) reach saturation as the number of threads increase (assuming no oversubscription and no preemption). In some cases, well written multi-thread algorithms can observer super scaling. This occurs when the additional threads can take advantage of the RAM/L3(/L2) latencies performed by a different thread.

Jim Dempsey

Sukruth_H_Intel · ‎02-03-2014

Hi Shiva,

Just to add my suggestion in terms of load balancing, where the worker threads would automatically been created and assigned to the core, You can have a look at the "Cilk Plus" and "Array notations" which would improve vectorization and do automatic load balancing.

https://www.cilkplus.org/

Regards,

Sukruth H V

Bernard · ‎02-05-2014

As Jim said each core has one vector SSE unit which is probably composed with floating point adder and multiplier.I think that the same unit also contains integer adder and multiplier which are hardwired to different execution ports.On newer microarchitecture branch logic was added to execution stack,but it is not probably tightly coupled to arithmetic units.

I could be wrong on my assumption that integer part of SSE unit is used to calculate memory addresses.

McCalpinJohn · ‎02-06-2014

The most detailed microarchitecture/implementation descriptions that I have seen for a broad range of processors are at Agner.org, in the documents called "microarchitecture.pdf" and "instruction_tables.pdf". Based on a combination of vendor documentation and very careful microbenchmarking, the former describes the microarchitecture, while the latter shows how each instruction maps to the various execution pipelines. I think that you have to look at the code at this level of detail to compute the minimum execution time of a piece of code accurately. Of course even these tables only document the most common case(s) -- any implementation is going to have "corner cases" for which extra stalls occur in instructions that access memory (and sometimes in other classes of instructions as well).

For the Q9550 (Core 2 Quad) processor, Wikipedia says that this is made of two "Wolfdale" parts in one package. Agner Fog's "microarchitecture.pdf" includes this in the chapter titled "Core 2 and Nehalem pipeline", while the "instruction_tables.pdf" file includes a chapter on Wolfdale that seems pretty complete.

In this particular case, the microarchitecture notes state that the Core 2 has separate functional units for integer multiplication and floating-point multiplication, with integer multiplication instructions issued on port 1 and floating-point multiplication instructions issued on port 0. Memory reads (both scalar and aligned SSE) are issued to port 2, which is used only by memory read instructions, so there are no conflicts between these reads and any arithmetic instructions. On the other hand, unaligned SSE read instructions issue to port 0, port 5, and twice to port 2, so they are capable of interfering with the many other instructions that need to issue on ports 0 and 5.

But certainly the short answer is that on the Q9550 processor none of execution units are shared across cores, so the theoretical peak performance is linear in the number of cores used. Each of the dual-core chips inside the package shares the 6 MiB L2 across the two cores, so contention can begin when both cores on the same chip execute memory accesses that miss their private L1 caches. The two dual-core chips share a single Front-Side-Bus, so contention between cores on different chips can when two or more cores miss in the L2 cache.

Bernard · ‎02-06-2014

So it seems that Q9550 has different execution units for floating point arithmetic and for integer arithmetic each wired to different port.I wonder if legacy x87 floating point arithmetic is executed by different unit?

By looking at the description provided by John it seems that memory address calculation are performed by different integer unit thus not conflicting with integer unit.

Christian_M_2 · ‎02-27-2014

iliyapolak wrote:

So it seems that Q9550 has different execution units for floating point arithmetic and for integer arithmetic each wired to different port.I wonder if legacy x87 floating point arithmetic is executed by different unit?

By looking at the description provided by John it seems that memory address calculation are performed by different integer unit thus not conflicting with integer unit.

At least to the x87 I think I can make a statement. I assume it is a complete different unit. At least x87 is internally working (also in new hardware) with 80 bit data representation and not single or double precision. Therefore I assume hardware can not be reused for legacy x87.

Bernard · ‎02-27-2014

Christian M. wrote:

Quote:

iliyapolak wrote:
So it seems that Q9550 has different execution units for floating point arithmetic and for integer arithmetic each wired to different port.I wonder if legacy x87 floating point arithmetic is executed by different unit?

By looking at the description provided by John it seems that memory address calculation are performed by different integer unit thus not conflicting with integer unit.

At least to the x87 I think I can make a statement. I assume it is a complete different unit. At least x87 is internally working (also in new hardware) with 80 bit data representation and not single or double precision. Therefore I assume hardware can not be reused for legacy x87.

I can only suppose that FP execution stack of Haswell CPU contains inside legacy x87 circuitry,altough this in not stated in pasted link below.

http://www.realworldtech.com/haswell-cpu/4/

Christian_M_2 · ‎02-27-2014

I think we won't get a clear statement on this. Isn't x87 marked as 'out-dated' ? So new reviews will not concentrate on that.

BTW, the posted link is great, you get quite good information.

Bernard · ‎03-03-2014

>>>I think we won't get a clear statement on this. Isn't x87 marked as 'out-dated' ? So new reviews will not concentrate on that>>>

It is outdated ,but must be kept for compatibility and for scalar FP calculation with higher precision.

I suppose that x87 forms a part of FP execution stack which deals with scalar values.

TimP · ‎03-03-2014

Agner Fog has been keeping his tables up to date:

http://www.agner.org/optimize/instruction_tables.pdf

so maybe you could find some comparisons of the early core 2 quad vs. current CPU generations.

I'm still mystified as to what you were driving at; were you expecting that x87 instructions could execute in parallel with SSE instructions without requiring the same resources? I think we have reasonable guarantee there is no sharing of program-accessible registers, but it seems clear they do share micro-op execution pipelines. As to register sharing between instruction modes, that was tried when MMX was introduced, and abandoned when compatible CPUs came on the market with independent register sets. My personal, barely educated, guess would be that most independent coding for x87 instructions would reside in the rom and not be in dedicated circuitry.

According to my experience, compilers have given up attempting to cope with register pressure in 32-bit mode by using both x87 and simd registers. Communication between the register sets is impossibly slow, and the design of Windows 64-bit ABI seemed to exclude attempts to do that in X64. Shortage of integer registers is an even worse bottleneck in 32-bit mode. Intel compilers have dropped the support for combined x87 and SSE mode which was required for P-III and now support P-III and Athlon32 only in x87 mode; that stuff was already obsolescent when the core2 came out. But I'm going out on a limb in guessing you might have something like this in mind.