Memory Order Machine Clear Issues

jrzhou · ‎04-27-2004

Can anyone explain how (and why) the pipeline is cleared due to memory ordering issues? Any example?

Thanks.

ClayB · ‎04-29-2004

jrzhou -

What architecture are you asking about? Is this with regards to Pentium, Xeon, Itanium, or something else?

-- clay

jrzhou · ‎04-30-2004

I am talking about hyper-threading Pentium 4 cpus. (I guess it also applies to hyper-threading Xeon cpus.)

ClayB · ‎05-10-2004

jrzhou -

I've consulted with some architecture experts within Intel and they've given me several reasons that might cause excessive pipeline clearing.

The first is due to false sharing. If you've got data that is being accessed by two threads within the same cache line, you have false sharing. When one thread modifies its variable, the cache line becomes "dirty" and must be written out to each physical or logical processor that has this line in local cache.

The second case is when the processor detects the chance of a memory order violation. Since such a violation would result in an incorrect program execution, the hardware needs to make sure the correct memory order is maintained. How this problem is handled is going to be specific to the hardware implementation, but every Intel processor is built to detect the possibility and takes steps to guarantee correct memory ordering. There weren't many details about this, but I'm assuming that out of order execution can lead to accessing memory in an incorrect order.

The third possibility mentionedhad todo with self-modifying code, but I'm hoping that we don't need to deal with that.

Are any of the above helpful?

-- clay

Henry_G_Intel · ‎05-10-2004

The "Developing Multithreaded Applications: A Platform Consistent Approach" document on Intel Developer Services has a section describing false sharing and how to detect it with VTune. If you're interested, see "Avoiding and Identifying False Sharing Among Threads with the VTune Performance Analyzer" in Chapter 2.

Henry

TimP · ‎05-11-2004

SSE vectorization could produce an overlap between an aligned store and a following un-aligned load, with effects analogous to false sharing, but within a single thread. Future Intel compiler will look for this, and take action to remove the conflict from generated code:

FORALL(I=5:N)

A[5:N] = ...

B[1:N] = A[4:N] + B[1:N]

The compiler vectorizes by use of parallel SSE operations. The un-aligned loads of segments of A[] have to pick up datastored in both the previous and the currentiterations of the parallelized loop.

TimP · ‎05-11-2004

SSE vectorization could produce an overlap between an aligned store and a following un-aligned load, with effects analogous to false sharing, but within a single thread. Future Intel compiler will look for this, and take action to remove the conflict from generated code:

FORALL(I=5:N)

A[5:N] = ...

B[1:N] = A[4:N] + B[1:N]

The compiler vectorizes by use of parallel SSE operations. The un-aligned loads of segments of A[] have to pick up datastored in both the previous and the currentiterations of the parallelized loop.

TimP · ‎05-11-2004

SSE vectorization could produce an overlap between an aligned store and a following un-aligned load, with effects analogous to false sharing, but within a single thread. Future Intel compilers will look for this, and take action to remove the conflict from generated code:

DO I = 5,N

A = ...

B = A[I-1] + B

END DO

The compiler vectorizes by use of parallel SSE operations. The un-aligned loads of segments of A[] have to pick up datastored in both the previous and the currentiterations of the parallelized loop. The parallel code is slower than serial code, because the memory order hardware has to delay the load until the preceding store has gone to memory. In this simple case, that is corrected by "distributing" (splitting) the loop, letting all A[] go to memory before reading themback, without incurring the memory order clear for each load.