Intel® Moderncode for Parallel Architectures
Support for developing parallel programming applications on Intel® Architecture.

Memory Order Machine Clear Issues

jrzhou
Beginner
1,204 Views
Can anyone explain how (and why) the pipeline is cleared due to memory ordering issues? Any example?

Thanks.
0 Kudos
7 Replies
ClayB
New Contributor I
1,204 Views

jrzhou -

What architecture are you asking about? Is this with regards to Pentium, Xeon, Itanium, or something else?

-- clay

0 Kudos
jrzhou
Beginner
1,204 Views
I am talking about hyper-threading Pentium 4 cpus. (I guess it also applies to hyper-threading Xeon cpus.)
0 Kudos
ClayB
New Contributor I
1,204 Views

jrzhou -

I've consulted with some architecture experts within Intel and they've given me several reasons that might cause excessive pipeline clearing.

The first is due to false sharing. If you've got data that is being accessed by two threads within the same cache line, you have false sharing. When one thread modifies its variable, the cache line becomes "dirty" and must be written out to each physical or logical processor that has this line in local cache.

The second case is when the processor detects the chance of a memory order violation. Since such a violation would result in an incorrect program execution, the hardware needs to make sure the correct memory order is maintained. How this problem is handled is going to be specific to the hardware implementation, but every Intel processor is built to detect the possibility and takes steps to guarantee correct memory ordering. There weren't many details about this, but I'm assuming that out of order execution can lead to accessing memory in an incorrect order.

The third possibility mentionedhad todo with self-modifying code, but I'm hoping that we don't need to deal with that.

Are any of the above helpful?

-- clay

0 Kudos
Henry_G_Intel
Employee
1,204 Views
The "Developing Multithreaded Applications: A Platform Consistent Approach" document on Intel Developer Services has a section describing false sharing and how to detect it with VTune. If you're interested, see "Avoiding and Identifying False Sharing Among Threads with the VTune Performance Analyzer" in Chapter 2.
Henry
0 Kudos
TimP
Honored Contributor III
1,204 Views

SSE vectorization could produce an overlap between an aligned store and a following un-aligned load, with effects analogous to false sharing, but within a single thread. Future Intel compiler will look for this, and take action to remove the conflict from generated code:

FORALL(I=5:N)

A[5:N] = ...

B[1:N] = A[4:N] + B[1:N]

The compiler vectorizes by use of parallel SSE operations. The un-aligned loads of segments of A[] have to pick up datastored in both the previous and the currentiterations of the parallelized loop.

0 Kudos
TimP
Honored Contributor III
1,204 Views

SSE vectorization could produce an overlap between an aligned store and a following un-aligned load, with effects analogous to false sharing, but within a single thread. Future Intel compiler will look for this, and take action to remove the conflict from generated code:

FORALL(I=5:N)

A[5:N] = ...

B[1:N] = A[4:N] + B[1:N]

The compiler vectorizes by use of parallel SSE operations. The un-aligned loads of segments of A[] have to pick up datastored in both the previous and the currentiterations of the parallelized loop.

0 Kudos
TimP
Honored Contributor III
1,204 Views

SSE vectorization could produce an overlap between an aligned store and a following un-aligned load, with effects analogous to false sharing, but within a single thread. Future Intel compilers will look for this, and take action to remove the conflict from generated code:

DO I = 5,N

A = ...

B = A[I-1] + B

END DO

The compiler vectorizes by use of parallel SSE operations. The un-aligned loads of segments of A[] have to pick up datastored in both the previous and the currentiterations of the parallelized loop. The parallel code is slower than serial code, because the memory order hardware has to delay the load until the preceding store has gone to memory. In this simple case, that is corrected by "distributing" (splitting) the loop, letting all A[] go to memory before reading themback, without incurring the memory order clear for each load.

0 Kudos
Reply